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FOREWORD 


Realizing  the  need  for  a  publication  to  encourage  further 
scientific  approach  to  the  solution  of  many  traffic  problems,  the 
Eno  Foundation  is  pleased  to  present  this  methodical  discussion 
of  some  statistical  theories  and  their  application  in  the  analysis 
of  traffic  data. 

The  Foundation  was  fortunate  in  acquiring  the  services  of 
Dr.  Bruce  D.  Greenshields,  Professor  and  Executive  Officer,  Civil 
Engineering  Department,  and  Dr.  Frank  M.  Weida,  Executive 
Officer,  Department  of  Statistics,The  George  Washington  Universi- 
ty, as  co-authors.  By  knowledge  and  experience  they  are  eminently 
qualified.  They  have  been  guided  by  a  practical  insight  and  have 
shown  an  unusual  and  necessary  discernment  of  the  subject. 

In  some  quarters,  thinking  on  traffic  as  a  national  problem  has 
reached  a  degree  of  desperation.  This  is  due  partly  to  confusion. 
It  is  hoped  this  study  will  provide  some  clarification  by  em- 
phasizing the  importance  of  an  analytical  basis  for  initiating 
logical  improvements.  Such  procedure  should  tend  to  create  better 
understanding  and  much-needed  uniform  basic  methods. 

It  has  been  a  privilege  for  the  Eno  Foundation  to  provide  the 
preparation  and  publication  of  this  monograph.  Publication  has 
resulted  from  considerable  time  and  effort  by  both  authors  and 
the  Foundation  Staff. 

The  Eno  Foundation 
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PREFACE 


The  engineer,  and  particularly  the  traffic  engineer  working  in  a 
comparatively  new  field,  faces  constantly  the  need  for  new,  more 
precise  information.  To  obtain  this  information,  he  collects  and 
analyzes  data.  The  theory  and  procedures  to  be  followed  in  such 
analyses  have  long  been  known  to  the  statistician,  but  not  always 
to  the  engineer. 

Mathematics  he  learns  for  his  engineering  is  of  the  classical  type — 
algebra,  trigonometry,  calculus  —  in  which  exact  answers  are 
obtained.  In  statistics  no  answer  is  exact  for  there  is  always  a 
range  of  variability  within  which  the  true  answer  lies.  Variance, 
the  measure  of  this  variability,  may  in  some  cases  be  so  small 
that  the  result  for  practical  purposes  may  be  considered  exact. 
But  usually  it  is  not.  In  traffic  behavior,  a  phase  of  human  behavior, 
it  is  well  to  employ  the  "mathematics  of  human  welfare." 

Traffic  research  carried  on  at  various  times  over  a  period  of  years 
by  one  of  the  writers  has  served  to  confirm  the  fact  that  traffic 
behavior  tends  to  follow  definite  statistical  patterns.  The  difficulty 
of  solving  the  problems  encountered  in  analyzing  the  data  collected 
during  that  research  pointed  to  the  need  for  someone  to  gather 
together  and  explain  the  statistical  methods  most  pertinent  to 
traffic  analyses. 

In  response  to  this  need,  this  monograph  is  written.  Desired  in- 
formation, it  was  felt,  could  be  assembled,  developed,  and  presented 
most  effectively,  by  a  traffic  engineer  and  a  statistician  working 
together.  The  one  would  know  the  viewpoint  of  the  engineer  and 
the  limitation  of  his  statistical  training  and  vocabulary.  The  other 
would  provide  that  knowledge  and  skill  in  his  own  field  that  can 
be  obtained  only  after  years  of  work  and  study. 

The  authors,  despite  the  work  involved,  have  enjoyed  what  seemed 
to  them  a  very  worth  while  undertaking.  This  monograph  is  not 
in  any  sense  the  last  word  on  the  subject.  It  is  merely  an  intro- 
duction, which  they  hope  will  assist  the  engineer  in  deterniining 
the  type  and  amount  of  data  he  needs  to  obtain  sufficiently 
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accurate  answers  to  his  problems  and  save  him  time  and  effort. 
They  trust  that  if  it  is  a  new  tool  to  him  it  will  be  to  his  liking. 

In  the  first  four  chapters  the  authors  have  attempted  to  explain 
this  mathematical  tool,  and  in  the  last  one  they  have  attempted  to 
show  how  to  use  it. 

The  authors  wish  to  thank  the  Eno  Foundation  and  staff  for  its 
kindly  criticism,  good  counsel,  encouragement  and  sponsorship. 
They  are  indebted  to  Professor  Herman  Betz  of  the  Department 
of  Mathematics  at  the  University  of  Missouri  for  his  careful  review 
of  the  manuscript. 

Washington  D.  C.  Bruce  D.  Green  shields 

June  1,  1952 

Frank  M.  Weida 
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CHAPTER  I 

THE  NATURE  AND  UTILITY  OF  STATISTICS 


1. 1.  General  Remarks.  The  rapid  movement  of  traffic  on  our  streets 
and  highways  in  ever  changing  patterns  is  one  of  the  most  familiar 
and  beneficial  phenomena  of  our  daily  lives  and  at  the  same  time  one 
of  the  most  confusing  and  vexing.  The  annoyances  and  even  danger 
experienced  in  driving  over  congested  streets  and  highways,  the 
lack  of  places  to  park  and,  in  general,  the  inadequacies  of  our  high- 
way system  are  widely  recognized.  There  is  clearly  a  need  for  in- 
creased knowledge  of  traffic  behavior  in  order  that  traffic  regula- 
tion and  planning  may  be  made  more  scientific.  The  method  by 
which  scientific  knowledge  is  increased  is  to  observe  what  happens 
and  then  by  inductive  reasoning  to  establish  general  laws  pertain- 
ing to  these  happenings.  It  is  the  purpose  of  this  book  to  develop 
a  scientific  system  known  as  Statistical  Methods  and  show  how  to 
use  these  methods  for  analyzing  and  solving  traffic  problems. 

Mathematical  probability,  which  is  the  basis  of  all  statistical 
theory,  had  its  beginning  in  ancient  times.  Certain  mathematical 
patterns  developed  as  pastimes  by  the  Greeks  and  others  were  first 
found  to  coincide  with  chance  happenings  such  as  occur  in  card 
games  and  later  found  to  coincide  with  actual  happenings.  It  was 
not  until  the  Seventeenth  Century  that  one  of  the  first  practical 
uses  was  made  of  probability,  when  life  expectancy  tables  were 
published  for  use  in  computing  life  insurance  premiums  and  bene- 
fits. Among  the  early  important  contributors  to  the  theory  of  pro- 
bability we  find  the  names  of  DeMoivre,  La  Place,  Gauss,  Pascal, 
Fermat  and  Bernoulli. 

The  methods  of  statistics  have  long  been  employed  by  the 
chemist,  the  sociologist,  the  physicist,  the  biologist,  the  bacteri- 
ologist, the  physiologist,  the  economist,  the  meteoroligist,  the 
business  man,  the  psychologist,  and  many  others.  In  the  biological 
sciences,  the  whole  theory  of  evolution  and  heredity  rests  in  reality 
on  a  statistical  basis.  Likewise,  the  behavior  of  the  body  mechanism 
itself  lends  itself  to  statistical  analysis.  Statistical  theory  is  the 
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basis  of  various  aspects  of  theoretical  physics  and  chemistry  as  de- 
monstrated by  Gibbs,  Bohr,  Einstein,  Fermi,  Dirac  and  others.  In 
the  social  sciences,  statistics  is  used  in  the  measurement  of  the 
sizes  of  the  population,  the  birth,  marriage,  mortality  and  morbi- 
dity rates,  and  in  determining  the  distribution  of  the  population  by 
trade  or  income,  wages,  prices,  production,  foreign  trade,  and 
transportation.  In  manufacturing,  statistics  facilitates  efficient 
management,  economic  control  of  the  quality  of  manufactured 
products,  and  the  evaluation  of  laws  of  behavior  to  determine 
control  or  lack  of  control.  Statistics  is  the  basis  of  corrective  legis- 
lation. But  in  spite  of  this  wide-spread  use,  it  is  only  within  the 
last  few  years  that  the  traffic  engineer  has  come  to  realize  that 
statistics  is  his  most  useful  tool1.  The  traffic  engineer  should  fully 
realize  the  importance  of  the  statistical  approach  to  the  solution 
of  his  problems.  If  there  has  been  some  failure  on  his  part  to  do  so, 
it  no  doubt  is  due  to  its  omission  from  his  engineering  training  in 
which  he  has  been  taught  to  assume  that  the  values  with  which  he 
is  dealing  are  exact  and  always  the  same.  Each  individual  piece  of 
material  of  a  given  kind  and  size  is  assumed  to  behave  the  same  as 
any  other  piece  of  the  same  kind  and  size.  Statistics  deals  with 
measurements  which  at  best  are  approximate  values  which  are 
usually  not  the  same  when  repeated.  In  traffic  engineering,  the  in- 
dividuals are  human  and  it  can  not  be  assumed  that  they  will 
always  behave  in  precisely  the  same  manner. 

The  automobile  does  not  become  a  complete  mechanism  until 
the  driver  is  behind  the  wheel.  It  is  the  driver  who  sees  the  curve 
ahead  and  turns  the  steering  wheel  accordingly,  who  sees  the  ob- 
struction and  applies  the  brakes.  It  is  the  emotional  and  physical 
characteristics  of  the  driver  that  must  be  measured  and  evaluated. 
To  this  end,  the  traffic  engineer  must  use  the  special  type  of  mathe- 
matics that  applies  to  the  problem  he  is  considering. 

In  this  attempt  to  make  statistics  more  readily  available  to  the 
traffic  engineer  and  others,  an  effort  will  be  made  not  only  to  ex- 
plain statistical  methods,  but  to  show  by  example  how  they  may 
be  used  in  the  solution  of  traffic  problems.  An  understanding  of  the 
calculus  is  desirable  but  not  essential  for  use  of  the  methods  in- 
volved. In  using  statistics  it  must  be  kept  in  mind  that  it  is  the 
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handmaiden  of  reality  and  not  reality  itself.  In  all  cases  it  must  be 
demonstrated  that  the  statistical  law  of  behavior  to  be  used  agrees 
with  actual  behavior. 

As  the  statistical  methods  are  developed,  it  will  be  found  that 
they  constitute  a  unified  structure.  This  will  become  apparent  as 
the  development  is  followed  step  by  step.  The  first  step  will  be  to 
explain  statistical  terms  through  the  derivation  and  explanation  of 
the  mathematical  and  statistical  probability  formulae  which  form 
the  basis  of  statistics.  The  use  of  these  formulas  will  become  clear 
through  their  application  to  the  solution  of  typical  problems. 

1.2.  Definition  and  Nature  of  Statistics.  Statistics  is  the  funda- 
mental and  most  important  part  of  inductive  logic.  It  is  both  an  art 
and  a  science,  and  it  deals  with  the  collection,  the  tabulation,  the 
analysis  and  interpretation  of  quantitative  and  qualitative  mea- 
surements. It  is  concerned  with  the  classifying  and  determining  of 
actual  attributes  as  well  as  the  making  of  estimates  and  the  testing 
of  various  hypotheses  by  which  probable,  or  expected,  values  are 
obtained.  It  is  one  of  the  means  of  carrying  on  scientific  research 
in  order  to  ascertain  the  laws  of  behavior  of  things  -  be  they  animate 
or  inanimate.  Statistics  is  the  technique  of  the  Scientific  Method. 

I.  3.  Statistics  and  Mathematics.  Statistics  is  a  branch  of  applied 
mathematics.  It  differs  from  so-called  pure  mathematics  in  that  the 
values  in  statistics  are  approximations  or  estimates,  but  not  mere 
guesses.  The  rules  and  methods  of  operation  are  those  of  pure 
mathematics  for  it  is  the  tool  of  statistical  analysis. 

An  "exact"  value  in  pure  mathematics  may  be  thought  of  as 
one  of  the  possible  values  a  variable  may  assume.  There  are  but 
two  possibilities  in  pure  mathematics,  namely:  the  variable  has  a 
certain  value  or  it  does  not  have  that  value.  In  the  first  case,  the 
probability  is  1,  meaning  that  it  is  certain  that  the  variable  has 
that  value,  while  in  the  second  case  the  probability  is  zero,  mean- 
ing that  it  is  certain  that  the  variable  does  not  have  that  value. 

The  variable  in  statistics,  called  stochastic  variable  or  variate,  is 
much  more  general  than  the  variable  in  pure  mathematics.  The 
stochastic  variable  is  one,  to  each  of  the  many  possible  values  of 
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which,  there  is  attached  a  probability,  p,  that  it  attains  said  value. 
As  will  be  shown  in  Chapter  III,  this  probability  may  have  any 
value  between  zero  and  one.  This  fact  is  expressed  mathematically 
as  0  ^p  ^  1. 

The  stochastic  or  random  variable  may  be  discrete  or  continuous. 
It  is  called  discrete  if  it  can  take  on  only  certain  isolated  values  in 
an  interval  and  it  is  called  continuous  if  it  can  take  on  any  value 
in  an  interval.  It  is  to  be  noted  that  the  probability  that  a  con- 
tinuous stochastic  variable  has  a  specific  value  is  always  zero. 

1. 4.  Two  General  Types  of  Problems.  Statistics  deals  with  problems 
that  fall  into  two  general  categories. 

1.  The  first  of  these  categories  of  problems  has  to  do  with  charac- 
terizing a  given  set  of  numerical  measurements  or  estimates  of 
some  attribute  or  set  of  attributes  applying  to  an  individual  or  a 
given  group  of  individuals.  This  entails  the  finding  of  a  mathe- 
matical model  that  fits  the  pattern  of  the  variation  in  measure- 
ments or  the  variation  in  the  things  being  measured.  The  engineer 
is  familiar  with  the  fact  that  a  distance  may  be  measured  several 
times  with  a  different  result  each  time,  and  he  knows  that  the 
mathematical  pattern  called ' '  The  Principle  of  Least  Squares"  is  used 
in  characterizing  such  measurements.  In  studying  some  attribute 
such  as  the  ability  of  students,  it  is  found  that  there  are  just  as 
many  brighter  than  "average"  as  there  are  less  bright  and  this 
pattern  is  called  "normal"  and  there  is  a  mathematical  equation 
for  such  a  normal  curve.  Other  laws  of  behavior  (distributions)  are 
found  to  follow  other  mathematical  patterns,  such  as  Poisson's 
"random"  curves  (distributions),  the  Pearson  system  of  distribu- 
tion and  others. 

Fortunately,  these  mathematical  patterns  are  all  of  the  same 
basic  nature.  It  will  be  one  of  our  tasks  to  describe  and  explain 
this  phase  of  statistical  mathematics. 

2.  The  second  category  of  problems  has  to  do  with  characterizing 
an  attribute  or  attributes  belonging  to  all  individuals  of  the  group 
one  is  investigating,  such  as  all  white  pine  lumber  or  all  the  people 
living  in  Ponca  City,  all  people  with  red  hair,  or  all  aluminum 
alloys  of  a  given  specification.  These  well  defined  classes  of  items 
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are  called  populations  or  "universes".  This  second  class  of  problems 
involves  the  selection  of  random  samples  from  the  population,  the 
statistical  study  of  these  samples,  and  the  drawing  of  inferences 
from  them. 

The  problems  just  mentioned  indicate  that  (1)  the  data  must  be 
summarized  as  will  be  discussed  in  Chapter  II;  (2)  they  must  be 
thoroughly  analyzed  by  obtaining  mathematical  patterns  of  the 
laws  of  their  behavior  as  will  be  discussed  in  Chapter  III;  and  (3) 
it  must  be  possible  to  draw  inferences  from  the  samples  in  regard 
to  the  reliability  and  significance  of  pertinent  summary  values 
obtained  from  the  samples  for  the  purpose  of  characterizing  the 
"universe"  as  will  be  discussed  in  Chapter  IV. 

1. 5.  Types  of  Sampling.  One  may  classify  random  sampling  in  two 
ways:  (1)  Sampling  by  attributes;  and  (2)  Sampling  by  variables, 
either  discrete  or  continuous.  In  sampling  by  attributes,  one  deter- 
mines the  number  of  times  (the  frequency)  the  event  happened  as 
specified  and  the  number  of  times  the  event  did  not  happen  as  speci- 
fied. In  sampling  by  variables,  we  measure  such  things  as  the  weight 
or  length  of  an  object,  the  duration  of  an  event  or  the  intensity  of  a 
force.  We  may  also  measure  a  group  of  individuals  in  order  to 
characterize  them  in  regard  to  multiple  categories  such  as  weights, 
heights,  temperatures,  etc.,  to  be  considered  jointly.  The  basis  of 
all  such  characterizations  is  counting.  Hence  we  must  determine 
the  frequency  of  the  occurrence  of  a  characteristic  or  event  among 
n  possible  occurrences  or  non-occurrences  or  among  n  trials. 

I.  6.  The  Variables  to  be  Measured  and  Interpreted.  The  statistical  or 
scientific  method  applies  not  only  to  the  analysis  and  interpreta- 
tion of  data  but  to  the  whole  procedure  of  first  recognizing  the 
need  for  increased  knowledge  about  a  particular  problem ;  second, 
the  gathering  of  data  about  the  problem ;  third,  studying  the  signi- 
ficance of  the  data;  and  finally,  presenting  the  results  of  the  in- 
vestigation in  a  report.  In  carrying  out  this  statistical  procedure 
there  are  certain  precautions  that  must  be  observed. 

The  recognition  of  the  need  for  more  information  about  a  parti- 
cular problem  usually  comes  from  those  who  have  to  deal  with  it. 
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A  research  project  conducted  in  Ohio  in  19394  will  serve  to  illustrate 
the  steps  in  conducting  an  investigation  to  obtain  certain  specific 
information.  This  study  had  to  do  with  center-line  markings  of 
roadways.  The  fact  that  different  states  had,  and  still  have, 
different  systems  of  markings,  causing  confusion  to  motorists, 
pointed  to  the  obvious  need  of  determining  the  best  type. 

The  first  question  to  be  answered  was :  Is  the  problem  solvable 
by  statistical  methods  ?  If  so,  what  method  or  methods  are  applic- 
able, what  variables  need  to  be  measured,  how  much  data  are 
needed,  and  how  best  to  obtain  the  needed  data  ? 

In  the  problem  of  center-line  marking,  one  is  interested  in  the 
qualities  that  make  a  good  center-line  marking.  Some  such 
qualities  are  visibility,  interpretability  and  durability.  But  what 
about  other  things  ?  Is  a  broken  line  just  as  satisfactory  for  a 
center-line  as  a  solid  line  ?  The  broken  line  is  cheaper  because  it  re- 
quires less  paint.  What  kind  of  a  line  or  lines  should  be  used  to 
mark  a  "no-passing"  zone  ?  Such  questions,  of  course,  can  only  be 
answered  after  the  study  is  made.  Hence  it  was  necessary  to  make 
a  provisional  conjecture  as  to  what  types  of  center-line  marking 
should  be  tested. 

I.  7.  Means  of  Measuring  the  Variable,  and  Precautions  to  be  taken. 
Having  decided  provisionally  on  what  types  of  center-lines  to  test, 
the  next  step  was  to  devise  a  means  of  measurement.  Should  it  be 
done  by  noting  the  behavior  response  pattern  of  drivers  to  different 
types  of  markings  ?  Should  a  speed  check  be  made  ?  Should  drivers 
be  questioned  ?  Should  some  other  methods  be  used  ?  What  is  the 
probable  cost  and  efficiency  of  the  different  possible  methods  ? 
What  type  of  equipment  is  necessary  to  make  the  recordings  ? 

It  has  been  found  by  experience  that  it  is  sometimes  necessary7 
to  design  and  construct  special  equipment  or  apparatus  to  record 
field  data.  It  is  recalled  that  in  19322  it  was  only  after  consider- 
able thought  that  the  rather  simple  expedient  of  time-motion 
pictures  was  used  to  record  the  speed  and  spacing  of  vehicles.  A 
mechanical  device,  provided  it  is  first  checked  for  mechanical 
defects,  is  always  more  reliable  than  human  judgment.  The  picture 
method  possessed  one  other  feature  that  is  not  often  attained.  It 
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gave  complete  information  on  all  that  happened  within  the  field  of 
view.  The  pertinent  information  could  then  be  selected  at  leisure 
and  if  a  wrong  conjecture  was  made,  other  information  already  in 
hand  could  be  studied. 

It  was  decided  in  the  1939  project  to  take  speed  recordings  with 
the  Eno-scope,  a  device  using  mirrors  so  arranged  that  the  time  at 
which  a  vehicle  passes  two  successive  positions  on  the  roadway- 
can  be  recorded  by  means  of  a  stop  watch.  These  positions  must  be 
a  considerable  distance  apart,  usually  88  or  176  feet,  so  that  the 
human  variation  in  snapping  the  watch  will  not  cause  an  appreci- 
able error.  Another  source  of  error  that  is  not  so  readily  apparent 
is  the  inability  of  the  observer  to  take  a  random  sample  without 
taking  the  proper  precautions  to  obtain  one.  It  would  seem  that  if 
the  observer  simply  recorded  the  speed  of  as  many  vehicles  as 
possible  it  would  result  in  an  unbiased  sample,  but  such  is  not  the 
case.  Vehicles  tend  to  bunch  into  queues  behind  the  slower  drivers. 
Depending  upon  the  alertness  of  the  observer,  he  may  be  un- 
consciously selecting  slow  or  fast  vehicles.  He  must  arbitrarily 
select  some  convenient  numbered  vehicle  such  as  every  third  one. 

This  device  is  not  infallible.  Suppose,  for  instance,  that  an 
origin-destination  survey  is  being  conducted  to  determine  the 
travel  routes  of  people  living  in  different  sections  of  a  city,  and 
that  it  has  been  decided  to  interview  every  tenth  house  starting 
from  an  arbitrary  point.  But  would  we  be  correct  in  assuming  that 
every  tenth  house  constitutes  a  good  random  sample  ?  It  could  be 
that  every  tenth  house  is  a  corner  house  and  hence  may  be  a  shop 
of  some  kind.  In  this  case,  some  special  procedure  must  be  used, 
such  as  writing  the  numbers  on  cards  and  after  shuffling,  picking 
every  tenth  card. 

I.  8.  The  Size  of  the  Sample.  The  size  of  the  sample  is  the  quantity  of 
data  needed  to  meet  certain  considerations.  One  of  the  considera- 
tions is  cost,  another  is  time.  These  depend  upon  the  decision  as  to 
(1)  the  maximum  error  that  will  be  tolerated  and  (2)  the  degree  of 
certainty  demanded  that  this  allowable  or  maximum  error  will  not 
be  exceeded.  This  definitely  determines  the  size  of  the  sample  or  the 
amount  of  data  to  be  collected.  The  method  of  gathering  the  data 
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is  largely  dependent  upon  the  structure  and  character  of  the 
•  'universe"  from  which  the  sample  is  taken. 

In  the  Ohio  study  of  1939,  it  was  desired  among  other  things  to 
get  the  opinions  of  drivers  about  center-lines.  Did  they  prefer  a 
yellow  line,  a  white  one,  a  broken  line,  or  a  solid  line  ?  The  obvious 
procedure  was,  of  course,  to  stop  each  motorist  and  ask  his  opinion. 
But  how  many  ?  Would  the  majority  of  30  or  40  people  agreeing  on 
one  combination  as  being  the  best  be  sufficient  ?  At  first  one  might 
possibly  say  yes,  but  on  second  thought  he  would  realize  that  all 
opinions  might  not  be  unbiased.  Perhaps  the  drivers  from  Pennsyl- 
vania had  grown  accustomed  to  a  certain  combination  and  would 
prefer  that,  or  the  drivers  from  Ohio  might  prefer  a  different 
system.  This  possible  tendency  to  biased  opinions  meant  that  a 
larger  sample  should  be  taken  and  also  that  along  with  the  opin- 
ions, the  residence  of  the  driver  should  be  ascertained. 

Sometimes  opinions  are  unconsciously  biased.  This  fact  also  was 
brought  out  in  the  Ohio  study.  It  was  decided  to  try  road  signs 
worded  to  warn  drivers  that  they  were  entering  a  "no-passing" 
zone.  It  was  doubted  that  a  large  percentage  of  the  motorists 
would  see  the  signs,  but  surprisingly  enough,  over  98  percent  of 
them  stated  they  had  seen  the  signs.  This  was  so  unexpected  that 
it  was  questionable,  and  a  way  of  checking  these  answers  was  sought. 

The  means  of  checking  was  revealed  through  consideration  of 
the  purpose  of  the  sign.  Signs  aside  from  those  whose  shape  conveys 
a  message,  must  be  read.  A  sign  much  larger  than  the  "no-passing" 
sign  was  prominently  displayed  to  warn  the  drivers  that  they  were 
entering  a  "test-zone".  This  might  have  been  guessed  from  the 
fact  that  they  had  seen  3  or  4  different  types  of  marking  within  a 
mile  or  so,  but,  over  one-third  when  questioned  said  they  did  not 
know  they  were  in  a  "test-zone".  The  conclusion  reached  was  that 
at  least  one-third  and  probably  more  did  not  see  the  "no-passing" 
signs  in  spite  of  the  fact  that  98  percent  said  they  had. 

I.  9.  The  Validity  and  Reliability  of  Measurement.  It  is  not  only  opin- 
ion measurements  that  must  be  checked  for  validity.  In  a  study  of 
brake-reaction-time  made  in  Ohio  in  19343,  it  was  decided  to  de- 
termine whether  the  facts  warranted  the  assumption  that  those 
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with  quick  reaction-time  were  safer  drivers.  It  was  perhaps  per- 
fectly logical  to  assume  that  a  quick  reaction  will  enable  a  driver 
to  avoid  accidents,  but  the  study  showed  no  relationship  of  acci- 
dents to  brake-reaction-time.  If  this  were  true,  and  other  investi- 
gations have  shown  that  it  is,  then  we  deduce  that  an  individual 
with  a  slow  reaction-time  employs  a  larger  margin  of  safety  and  so 
compensates  for  his  shortcoming.  In  other  words,  brake-reaction- 
time  is  not  a  valid  measurement  to  determine  whether  a  driver  is 
a  safe  driver  or  not  since  it  does  not  in  fact  measure  what  it  was 
supposed  to  measure. 

A  measurement  is  reliable  if  there  is  consistency  in  obtaining  it. 
In  other  words,  consistency  in  measurements  increases  our  con- 
fidence in  the  reliability  of  the  conclusion  we  wish  to  draw  from 
the  set  of  measurements. 

1. 10.  Cost  of  the  Project.  After  the  amount  of  data  needed  to  obtain 
results  accurate  to  the  degree  desired  has  been  estimated,  the 
apparatus  needed  has  been  decided  and  the  procedure  outlined,  it 
is  possible  to  estimate  the  minimum  cost.  This  cost  will  depend  to 
a  large  extent  on  the  amount  of  personnel  needed  and  the  time  re- 
quired to  complete  the  study.  The  cost  of  development  research  is 
easier  to  estimate  than  that  of  basic  or  fundamental  research.  In 
the  former  we  know  much  more  about  the  expected  results.  Deve- 
lopment research  follows  the  fundamental.  It  is  often  used  to 
verify  results  that  have  been  suggested  by  more  basic  studies.  In 
any  case,  however,  it  is  necessary  to  estimate  the  cost.  The  skill  of 
the  researcher  is  rightly  or  wrongly  measured  by  his  ability  to 
estimate  correctly  this  cost  and  effort  required  to  carry  on  an  in- 
vestigation to  the  point  where  definite  results,  whether  positive  or 
negative,  are  obtained  and  reported. 

1. 11.  The  Report.  A  preconceived  idea  or  system  of  thinking  must 
not  be  allowed  to  influence  the  reporting  of  results.  A  negative 
result  is  just  as  important  as  a  positive  one.  Too  often  an  investi- 
gation is  conducted  to  prove  a  point  and  this  attempt  to  adhere 
to  an  established  opinion  may  have  undue  influence  in  selecting 
the  attribute  to  measure. 
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The  results  of  a  scientific  investigation  should  be  presented  with 
the  same  care  that  was  used  in  conducting  the  survey.  All  too  often, 
information  is  brought  to  light  only  to  lose  its  value  through  poor 
presentation.  Knowledge  is  useful  only  as  it  becomes  known.  For- 
tunately there  has  been  developed  a  recognized  style  of  engineering 
reports  and  several  good  books  on  the  subject  are  available.5  It 
should  be  emphasized  that  the  writing  of  the  report  should  be  con- 
sidered a  part  of  any  scientific  investigation,  and  a  most  important 
part. 

1. 12.  Purpose  of  the  Book.  Having  indicated  the  general  procedure, 
and  noted  some  of  the  precautions  that  need  to  be  taken,  we  shall 
now  attempt  to  discuss  the  necessary  theory  and  outline  the  techni- 
ques for  the  solution  of  traffic  problems.  Finally  we  shall  attempt  the 
solution  or  partial  solution  of  some  of  the  more  typical  problems. 

Chapter  II  presents  the  method  of  summarizing  data  and  ob- 
taining summary  numbers  that  are  useful  for  the  analysis,  char- 
acterization and  interpretation  of  one  or  more  sets  of  measure- 
ments. 

Chapter  III  presents  the  theory  and  basis  of  the  various  mathe- 
matical patterns  (laws  of  behavior)  that  are  the  underlying  prin- 
ciples upon  which  the  analysis  and  interpretation  of  the  results 
depend. 

Chapter  IV  shows  the  use  of  summary  methods  of  Chapter  II 
and  the  basic  theory  of  Chapter  III  to  solve  problems  by  statistical 
methods  and  to  ascertain  the  reliability,  validity,  significance,  and 
meaning  of  the  solution. 

Chapter  V  outlines  the  solution  or  partial  solution  of  some 
typical  as  well  as  some  of  the  more  unusual  traffic  problems. 


REFERENCES,  CHAPTER  I 

1  Kinzer,  John  P.  "Application  of  the  Theory  of  Probability  to  Problems 
of  Highway  Traffic,''''  Proceedings,  Institute  of  Traffic  Engineers,  1934, 
pages  118-123. 

Adams,  W.  F.,  "Road  Traffic  Considered  as  a  Random  Series,'"  Institution 
of  Civil  Engineers  Journal,  November  1936,  pages  121-130. 


NATURE  AND  UTILITY  OF  STATISTICS  11 

Greenshields,  Bruce  D.,  '''Initial  Traffic  Interference,"  Presented  for  dis- 
cussion at  the  16th  Annual  Meeting  of  the  Highway  Research  Board, 
November  19,  1936,  Washington,  D.  C,  9  pages  mimeo  and  the  comments 
by  W.  F.  Adams 

2  Greenshields,  Bruce  D.,  "The  Photographic  Method  of  Studying  Traffic 
Behavior,"  Proceedings,  Highway  Research  Board,  Washington,  D.C.,  1933 
pages  384-399. 

Ibid.,  Schapiro,  Donald;  and  Ericksen,  ElroyL.;  "Traffic  Performance 
at  Urban  Street  Intersections,"  Yale  Bureau  of  Highway  Traffic,  New  Haven, 
Connecticut,  1947,  pages  73-118. 

3  Ibid.,  "Reaction  Time  in  Automobile  Driving,"  Journal  of  Applied 
Psychology,  Vol.  XIX,  No.  3,  June  1936,  pages  35&-358. 

4  Report  of  Highway  Research  Board  Project  Committee  on  "Markings 
for  No-Passing  Zones,"  November  1939. 

5  Nelson,  J.  Raleigh,  "Writing  The  Technical  Report"  McGraw-Hill  Book 
Co.,  1947. 


CHAPTER  II 

SUMMARIZING  OF  DATA 


II.  1.  Objective.  After  the  data  have  been  collected,  it  is  not  only  con- 
venient but  necessary  that  they  be  condensed  in  order  to  be 
analyzed  and  interpreted  by  means  of  summary  numbers  which 
serve  to  characterize  the  data.  Some  summary  numbers  are  averages 
and  included  among  them  are  the  mean,  the  median,  the  mode,  and 
the  standard  deviation. 

This  chapter  shows  how  to  summarize  data  both  analytically 
and  graphically.  The  procedures  will  be  made  clear  by  examples. 

II.  2.  Frequency  Distribution.  A  frequency  distribution  constitutes 
the  first  step  in  classifying  and  condensing  data.  It  is  an  arrangement 
in  which  the  data  consisting  of  separate  values  or  measurements 
of  a  variable  are  combined  into  groups  called  classes  covering  a 
limited  range  of  values,  such  as  1  to  5  miles,  5  to  10  miles,  etc.  The 
number  of  values  in  each  class  is  called  the  class  frequency.  Once 
the  observations  have  been  combined  into  groups,  the  individual 
items  lose  their  identity  and  the  midpoint  of  the  class  group  be- 
comes a  unit  quantity  with  a  broader  meaning.  This  requires  that 
the  grouping  be  done  in  such  a  way  that  it  will  accurately  re- 
present the  items  from  which  it  is  computed.  The  methods  to  be 
followed  will  become  clear  with  an  examination  of  the  construc- 
tion of  a  frequency  table. 

II.  3.  Class  Interval  and  Class  Mark.  A  class  interval  sets  boundaries 
or  limits  to  a  class  of  a  frequency  distribution.  In  Table  II.  1.,  the 
lower  bounds  of  the  classes  are  15,  20,  . . . ;  the  upper  bounds  are 
19,  24,  29,  .  .  .  ;  the  lower  boundaries  or  limits  are  14.5,  19.5  .  .  . ; 
the  upper  limits  or  boundaries  are  19.5,  24.5,  ....  The  class  interval 
is  5.  By  the  laws  of  approximate  numbers,  the  data  have  been 
rounded  off  to  the  nearest  whole  number  so  that  the  speeds  are 
correct  to  the  nearest  mile  per  hour. 
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Table  II.  1 


SPEED  EST  MILES  PER  HOUR  OF  FREE  MOVING  VEHICLES  ON  SEPTEMBER  16,  1939, 
IN  OAKLAWN,    ILLINOIS  ON  U.S.H.   12  and  20  AT  A  POINT  ONE  MILE  EAST  OF 

HARLEM  AVENUE 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

Speed 

in 
m.p.h. 

Number 

of 
Vehicles 

Smoothed 
Fre- 
quency 

PerCent 
of 

Vehicles 

Relative 
Frequency 

Cumulative 
Frequency 

Cumulative 

Per  Cent 
Frequency 

s 

f 

fc 

100  f/n 

f/n 

fc 

100  fc/n 

70-74 

0 

0 

0 

0 

65-69 

0 

0.7 

0 

0 

60-64 

2 

5.7 

0.67 

0.0067 

300 

100.00 

55-59 

15 

10.3 

5.00 

0.0500 

298 

99.33 

50-54 

14 

19.3 

4.67 

0.0467 

283 

94.33 

45-49 

29 

39.0 

9.67 

0.0967 

269 

89.67 

40-44 

74 

54.3 

24.67 

0.2467 

240 

80.00 

35-39 

60 

65.7 

20.00 

0.2000 

166 

55.33 

30-34 

63 

50.7 

21.00 

0.2100 

106 

35.33 

25-29 

29 

32.7 

9.67 

0.0967 

43 

14.33 

20-24 

6 

14.3 

2.00 

0.0200 

14 

4.67 

15-19 

8 

4.7 

2.67 

0.0267 

8 

2.67 

10-14 

0 

2.7 

0 

0 

0 

.00 

300  =n 

300.1  =n 

100.02 

1.0002 

Data  furnished  by  Public  Roads  Administration,  Washington,  D.  C. 

Note:  This  illustration  is  of  a  continuous  stochastic  variable  which  may  take  any  value.  An 
illustration  of  a  discontinuous  variable  is  the  numbers  of  vehicles  that  pass  over  a  highway  in 
any  time  interval.  There  is  no  such  thing  as  a  part  of  a  vehicle.  An  illustration  of  a  discontinuous 
stochastic  variable  where  only  even  integers  are  possible  is  the  distribution  of  rows  of  kernels  on 
ears  of  corn. 


A  class  mark  is  the  mid- value  of  the  class  interval.  In  Table  II.  1., 
column  (1),  the  class  marks  are  17,  22,  27,  .... 

The  exact  values  of  a  discontinuous  variable  are  usually  taken 
equal  to  the  class  marks.  For  many  purposes,  all  the  values  of  a 
continuous  variable  that  fall  within  a  given  class  interval  are 
grouped  at  the  class  mark  as  a  convenient  approximation. 

The  number  of  values  that  the  variable  has  within  a  certain  class 
interval  is  called  a  class  frequency.  In  Table  II.  1.  the  frequency 
63  in  column  (2)  corresponds  to  the  class  30-34  in  column  (1). 
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Two  conditions  which  serve  as  a  guide  in  the  choice  of  the  size  of 
a  class  interval  are :  (a)  the  desire  to  be  able  to  treat  all  the  values 
assigned  to  any  one  class,  without  appreciable  error,  as  if  they 
were  equal  to  the  mid- value  or  class  mark  of  the  class  interval: 
(b)  for  convenience  and  brevity,  it  is  desirable  to  make  the  class 
interval  as  large  as  possible,  but  always  subject  to  the  first  con- 
dition. These  two  conditions  will  in  general  be  fulfilled  if  the  inter- 
val is  so  chosen  that  the  number  of  classes  lies  between  ten  and 
thirty.  This  does  not  mean,  however,  that  the  minimum  may  not 
be  less  than  ten  classes  nor  the  maximum  more  than  thirty  classes ; 
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Figure  II.  1 
Frequency  Rectangles  of  Observed  Vehicle  Speeds 
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it  merely  means  that  in  most  cases  it  is  possible  to  form  the  classi- 
fication with  the  number  of  intervals  lying  between  ten  and 
thirty. 

Another  convenient  means  of  classification  is  the  graphical 
summary  method.  There  are  five  types  of  graphs  that  have  been 
found  useful:  namely,  the  Frequency  Rectangles,  the  Histogram, 
the  Frequency  Polygon,  the  Smoothed  Frequency  Polygon,  and  the 
Frequency  Curve.  We  shall  now  discuss  these  in  the  order  named. 
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Figure  II.  2 
Histogram  of  Observed  Vehicle  Speeds 


II.  4.  Frequency  Rectangles.  Using  the  frequency  distribution  as  given 
by  columns  (1)  and  (2)  in  Table  II.  1.,  the  rectangles,  shown  in 
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Figure  II.  1  may  be  drawn.  The  class  intervals  are  the  bases  and 
the  altitudes  (ordinates)  are  equal  to  the  frequencies  of  the  classes. 
Unit  area  is  defined  as  that  of  a  rectangle  whose  base  is  a  class 
interval  and  whose  altitude  is  a  unit  of  frequency.  This  gives  a 
one  to  one  correspondence  between  area  and  frequency.  In  other 
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Figure  II.  3 
Frequency  Polygon  of  Observed  Vehicle  Speeds 


words,  since  the  base  is  equal  to  one  (class  interval),  the  height  is 
the  frequency. 

II.  5.  Histogram.  A  histogram  is  the  system  of  upper  bases  of  the  fre- 
quency rectangles.  It  is  illustrated  in  Figure  II.  2.  for  the  fre- 
quency distribution  given  by  columns  (1)  and  (2)  of  Table  II.  1. 
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II.  6.  Frequency  Polygon.  A  frequency  polygon  is  formed  by  selec- 
ting a  convenient  horizontal  scale  for  the  variable  being  measured 
and  a  vertical  scale  for  the  class  frequency  and  then  plotting  the 
points  so  that  the  class  marks  are  the  abscissas  and  the  class  fre- 
quencies are  the  ordinates.  This  method  is  shown  in  Figure  II.  3. 
for  the  distribution  given  in  Table  II.  1. 

II.  7.  Smoothed  Frequency  Polygon.  The  smoothed  frequency  polygon 
is  a  means  of  graduation  sometimes  called  a  method  of  moving  aver- 
ages. It  is  useful  in  obtaining  an  approximation  to  the  probable 
frequency  curve  or  theoretical  law  of  behavior  of  the  attribute 
that  is  being  measured. 

One  method  of  obtaining  moving  averages  is  illustrated  in 
Columns  (1),  (2),  (3),  in  Table  II.  1.,  in  which  the  smoothed  value 
for  an  interval  is  obtained  by  summing  the  frequencies  in  that 
interval  and  the  two  adjacent  intervals  and  dividing  by  three. 
Hence,  the  smoothed  value  for  the  interval  15-19  is  equal  to  the 
sum  of  the  frequencies  0,  8,  and  6,  divided  by  3.  For  the  interval 
20-24,  we  add  the  frequencies  8,  6,  and  29,  and  divide  the  sum  by 
3.  We  proceed  likewise  for  the  remaining  intervals.  The  smoothed 
frequency  polygon  for  the  distribution  given  in  columns  (1)  and 
(3)  of  Table  II.  1.  is  shown  in  Figure  II.  4.  By  comparing  Figure 
II.  4  with  Figure  II.  3.,  it  is  seen  that  the  smoothed  frequency 
polygon  has  removed  the  irregularities  found  in  Figure  II.  3.  and 
is  closer,  in  appearance,  to  a  frequency  curve.  See  definition  of 
frequency  curve,  Article  II.  8. 

The  number  of  classes  over  which  an  average  is  taken  does  not 
need  to  be  three.  The  decision  as  to  the  number  of  classes  that 
should  be  taken  depends  upon  the  total  frequency,  the  total 
number  of  classes  in  the  distribution,  the  size  of  the  class  interval, 
the  equality  or  inequality  of  the  classes,  and  the  experimental 
error,  the  discussion  of  which  is  beyond  the  scope  of  this  book.  The 
process  of  smoothing  tends  to  correct  for  sampling  errors,  grouping 
errors,  and  experimental  errors. 

An  important  point  to  note  is  that  the  total  area  within  the 
rectangles,  the  histogram,  the  frequency  polygon,  the  smoothed 
frequency  polygon  and  within  the  frequency  curve  is  equal  to  the 
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total  frequency  n.  This  total  frequency  in  terms  of  probability  is 
thought  of  as  one  and  in  terms  of  percent  as  100  per  cent.  The 
height  of  the  frequency  rectangles  is  then  expressed  as  a  fraction 
or  a  per  cent. 
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Figure  II.  4 
Smoothed  Frequency  Polygon  of  Observed  Vehicle  Speeds 


II.  8.  Frequency  Curve.  A  smooth  curve  superimposed  upon  the  fre- 
quency polygon  or  smoothed  frequency  polygon  so  that  the  area 
under  it  is  equal  to  the  total  frequency  is  known  as  a  frequency 
curve.  The  frequency  curve  is  an  estimate  of  the  limit  that  -would  be 
approached  by  a  frequency  polygon  or  a  smoothed  frequency 
polygon  if  we  indefinitely  decreased  the  size  of  the  class  intervals 
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and  at  the  same  time  indefinitely  increased  the  frequency  n.  An 
illustration  of  a  frequency  curve  for  the  distribution  given  in 
Table  II.  1.  is  given  in  Figure  II.  5.  where  the  points  of  the 
smoothed  frequency  polygon  have  been  used. 
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Figure  II.  5 
Frequency  Curve  of  Observed  Vehicle  Speeds 

II.  9.  Cumulative  Frequencies.  Another  type  of  distribution  can  be  se- 
cured by  the  use  of  cumulative  frequencies.  These  values  are  shown 
in  column  (6),  Table  II.  1.,  and  are  obtained  by  successive  adding 
of  the  frequencies,  beginning  with  the  lowest  interval.  To  illus- 
trate: starting  with  8,  add  6  to  8  and  get  14;  then  29  +  14  which 
equals  43,  and  so  on  until  298  plus  2  equals  300  for  the  last  cumul- 
ative frequency  which,  of  course,  is  the  total  number  of  cases. 
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The  cumulative  frequency  distribution  in  the  example  given 
shows  how  many  vehicles  had  a  speed  below  (or  above)  a  given 
speed.  From  columns  (1)  and  (6)  in  Table  II.  1.,  we  find  that 
8  vehicles  had  a  speed  less  than  19.5  miles  per  hour,  14  had  a  speed 
less  than  24.5  miles  per  hour;  43  had  a  speed  less  than  29.5  miles 
per  hour  and  so  on.  In  some  cases  the  cumulative  frequencies  ex- 
pressed as  per  cents  of  the  total  frequencies  are  more  meaningful. 
These  per  cents  are  given  in  column  (7),  Table  II.  1.  According  to 
column  (7),  2.67  per  cent  of  the  vehicles  have  a  speed  less  than 
19.5  miles  per  hour,  4.67  per  cent  of  the  vehicles  have  a  speed  less 
than  24.5  miles  per  hour  and  so  on. 

To  obtain  the  graph  of  the  cumulative  frequencies  or  the  cumul- 
ative per  cent  frequencies,  the  points  are  plotted  with  cumula- 
tive values  as  ordinates  and  the  upper  limits  of  the  corresponding 
classes  as  abscissas. 

The  points  then  are  connected  with  straight  line  segments 
(polygon)  or  with  a  smooth  curve.  In  either  case  the  resulting 
graph  is  called  an  ogive.  The  curve  may  be  interpreted  as  portray- 
ing a  law  of  growth.  If  the  cumulation  is  in  the  opposite  direction, 
we  would  obtain  a  law  of  negative  growth.  In  the  case  given, 
2  vehicles  (0.67  per  cent)  have  a  speed  greater  than  59.5  miles  per 
hour;  17  vehicles  (5.67  per  cent)  have  a  speed  greater  than  54.5 
miles  per  hour  and  so  on.  The  ogive  for  both  the  absolute  and  per- 
centage scale  is  shown  in  Figure  II.  6. 

The  class  frequencies  may  also  be  expressed  as  per  cents  or 
relative  frequencies.  These  values  are  shown  in  columns  (4)  and  (5) 
of  Table  II.  1.  In  the  former  case,  the  total  area  has  been  made 
100  units  of  area  and  in  the  latter  case  the  total  area  has  been 
made  the  unit  of  area. 

If  Y  =  f  (X)  is  the  equation  of  the  frequency  curve,  then 


t/X 


x2 

YdX 


is  the  number  of  observations  having  a  value  between  Xx  and  X2. 
If  A  is  the  lower  limit  of  possible  values  of  the  variable  and  B 
is  the  upper  limit,  then  the  total  area  N,  namely,  the  total  fre- 
quency is 
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YdX  =  N. 


In  terms  of  relative  frequency  or  statistical  probability,  we  have 

Yc 


Speed  in   Miles    Per  Hour 

Figure  II.  6 
Cumulative  Frequency  Curve  of  Observed  Vehicle  Speeds 
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where  the  whole  area  under  the  frequency  curve  is  taken  as  the 
unit  of  area. 

In  the  latter  case,  Y  is  called  the  probability  density  and  YdX 
is  called  the  probability  element. 

For  the  cumulative  frequency  distribution,  in  the  theoretical 
case  in  terms  of  probability,  the  expression 


F  (X)  =£ 


YdX 


is  known  as  the  Distribution  Function  of  Probability  where 
F  (A)  =  0  and  F  (B)  =  1  and  A  ^  X  ^  B. 

Frequency  distributions  are  characterized  by  summary  numbers 
which  often  are  those  functions  of  the  measurements  known  as  ave- 
rages. These  averages  show  the  location  of  central  tendencies  (if 
any)  and  serve  as  bases  for  evaluating  differences  between  values 
(dispersion)  as  well  as  skewness  and  flatness  of  the  distribution. 
They  are  also  instrumental  in  isolating  extreme  or  unusual  values. 

II.  10.  Average.  An  average  is  a  function  of  the  entire  group  of  values 
such  that  if  all  the  values  were  equal  to  one  another  it  would  equal  each 
one  of  the  group  of  equal  values. 

In  general,  the  values  or  measurements  are  unequal,  some  being 
larger  and  some  being  smaller  than  the  average. 

Of  the  many  averages,  those  which  are  of  most  use  and  interest 
to  the  statistician  are  first,  the  common  averages  including  the 
arithmetic  mean,  the  median,  the  mode,  the  geometric  mean,  and  the 
harmonic  mean ;  and  second,  the  averages  of  differences  including 
the  mean  (average)  deviation,  the  centra  harmonic  mean,  the  standard 
deviation,  and  the  moments. 

11.11.  Arithmetic  Mean.  Graphically,  the  arithmetic  mean  is  the 
abscissa  of  the  centroid  of  the  total  area  under  the  frequency  curve 
or  frequency  polygon. 

It  is  the  point  at  which  if  the  whole  area  is  considered  to  be  con- 
centrated, the  first  moment  of  the  total  area  will  equal  the  sum  of 
the  first  moments  of  the  components  of  area  into  which  the  total 
area  is  divided. 
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From  Figure  II.  7.,  if  f1?  f2,  .  .  .  fk  are  component  areas  and  if  Xx, 
X2,  . . .  Xt  are  their  corresponding  distances  from  the  Y-axis  and 
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Speed  in  Miles  Per  Hour 

Figure  II.  7 
Arithmetic  Mean  of  Observed  Vehicle  Speeds 

if  n  =  fx  +  f2  + +  fk,  is  the  total  area  and  X  is  its  distance 

from  the  Y-axis,  then 

nX=f1X1+f2X2  +  ...  +  fkXk 
whence 


f !  X1  +  f 2  X2  +  . . . .  +  fk  Xi 


SifiXi. 


II.  11. 1. 


24  STATISTICS  AND  HIGHWAY  TRAFFIC  ANALYSIS 

Algebraically :  The  arithmetic  mean  is  the  sum  of  all  the  values 
of  the  variable  divided  by  the  number  of  values.  If  X  is  the  arith- 
metic mean  and  Xls  X2,  .  . .  . ,  Xn  represent  the  values  of  the  vari- 
able X, then 

n 

x-  x1+x2  +  ...+x^gl  IL1L2. 

n  n 

To  illustrate:  Let  the  values  of  the  variable  X  be  10, 13,  17,  and 
18.  The  arithmetic  mean  of  these  values  is 

4 

-  _  Xj  +  X2  +  X3  +  X4    _giXj  _1Q  +13  +17  +  18_14g 
4  4  4 

When  certain  values  of  the  variable  occur  more  than  once,  the 
same  notation  may  be  used,  namely : 


-XT-  xx  +  Xx  +  Xx  +  X2  +  X2  +  X3  +  . .  +  Xk         __ 


But  another  symbolic  representation  is  more  convenient.  Let  fi 
be  the  frequency  or  number  of  times  the  variable  X  has  the  value 
Xi.  The  sum  of  the  values  Xi  is  fi  Xi.  Let  n  be  the  sum  of  the  fi 
where,  say,  there  are  k  different  values  of  Xi  and  hence  of  the  fi. 
This  symbolic  representation  gives 

k  k 

Si  fi  Xi        Si  f i  Xi  II.  11.  4. 


X 


k 

Sifi 


If  in  II.  11.  4.,  each  fi  =  1  and  k  =  n,  the  expression  for  X  is  the 
same  as  that  given  in  II.  11.  2. 

If  the  class  intervals  are  unequal  in  size,  the  computational 
process  may  be  simplified  by  making  a  simple  translation.  Let 

x'i  =Xi—  X0  II.  11.5. 

where  X0  may  be  any  convenient  value  whatsoever.  In  practice  it 
is  best  to  use  for  X0  the  midpoint  of  the  middle  class  if  there  are  an 
odd  number  of  classes,  if  there  are  an  even  number  of  classes,  use 
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the  midpoint  of  a  class  as  near  the  middle  of  the  distribution  as 
possible. 

Substituting  the  value  of  Xi  as  given  in  II.  11.  5.  in  equation 
II.  11.  4.,  we  have 

_       Si  fi  Xt        S  fi  (x'i  +  X0)       Si  fi  x'i       XoSi  fi 

X  =  J =  i =  1 + - 

n  n  n  n 

k  k 

Since  Si  * i  =  n  an^  2  f  i  /n  =  1 ; 

i  i 

SifiX'i 

X  =  X0  +  i II.  11.6. 

n 

In  the  special  case  when  all  class  intervals  are  equal,  we  may  use 
the  linear  transformation  (translation  and  change  of  unit) 

Xi  —  X0 

xi  =— II.  11.7. 

c 

where  c  is  the  size  of  the  class  interval. 

Using  the  value  of  Xt  from  II.  11.  7.  in  II.  11.  2., 

_       SiMcxi+Xo) 
X=  i 


n 


k  k 

X02i  fi  C   Si  fi  Xi 

=_L_+_i 

n  n 

This  when  simplified  becomes 


SifiXi 

X  =  X0  +  c  \  1 )  II.  11.  8. 

To  illustrate  II.  11.  8.,  we  may  use  the  frequency  distribution 
given  in  table  II.  1. 
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Table  II.  2 

SPEED  IN    MILES    PER    HOUR    OF  FREE  MOVING  VEHICLES  ON  SEPTEMBER   16, 

1939,    IN    OAKLAWN,    ILLINOIS    ON    U.S.H.    12    and    20   AT   A   POINT    ONE    MILE 

EAST  OF  HARLEM   AVENUE 


Speed  in  miles 

Number  of 
Vehicles 

x-x0  = 
s-  s0 

s-s0 

per  hour 

c 

X  =   S 

f 

s 

s 

fs 

70-74 

0 

30 

6 

0 

65-69 

0 

25 

5 

0 

60-64 

2 

20 

4 

8 

55-59 

15 

15 

3 

45 

50-54 

14 

10 

2 

28 

45-49 

29 

5 

1 

29 

40-44 

74 

0 

0 

0 

35-39 

60 

-5 

-1 

-60 

30-34 

63 

-10 

-2 

-126 

25-29 

29 

-15 

-3 

-87 

20-24 

6 

-20 

-4 

-24 

15-19 

8 

-25 

-5 

-40 

300 

-227 

Substituting  in  II.  11.  8.  the  necessary  values  from  Table  II.  2., 
we  find 

_  /Si  fi  X! 

X  =  X0+c     i 


becomes 


_  /— 227\ 

x=42+5  horH8-2- 


II.  11.9. 


This  result  is  approximate  in  that  in  addition  to  its  possessing  a 
sampling  error  and  an  experimental  error,  it  possesses  a  grouping 
error.  These  errors  will  be  discussed  later. 

This  arithmetic  mean  speed  of  38.2  miles  per  hour  is  the  estimate 
of  the  probable  or  expected  speed  of  a  vehicle  at  the  highway  point 
observed.  What  we  wish  to  know  about  the  mean  speed  is  first, 
whether  or  not  it  is  reliable  and  second,  the  range  of  speeds  above 
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or  below  it.  Is  38.2  miles  per  hour  characteristic  for  all  vehicles  and 
if  so,  to  what  extent  ?  We  are  able,  with  measures  of  dispersion, 
to  find  the  answers  to  these  questions.  After  doing  this,  we  must 
look  for  a  rational  explanation  of  the  agreement  between  the 
statistically  obtained  values  and  the  actual  facts;  we  must  also 
determine  what  these  facts  mean.  Were  different  types  of  vehicles 
observed  or  was  the  variety  of  speeds  due  to  drivers  with  different 
desires  or  different  abilities  in  driving,  or  to  some  other  cause  ? 
This  will  be  discussed  and  illustrated  in  Chapter  IV. 

II.  12.  Measure  of  Central  Tendency.  A  measure  of  central  tendency 
is  sometimes  thought  of  as  a  characterizing  or  descriptive  value,  a 
norm  or  a  typical  value.  It  is  always  an  average.  But  an  average  in 
itself  is  not  necessarily  a  measure  of  central  tendency.  For  this  to 
be  true,  the  average  must  agree  fairly  closely  with  all  of  the  values 
from  which  it  is  obtained. 

II.  13.  Mathematical  Expectation  or  Expected  Value  of  a  Variable. 
The  expected  value  of  a  particular  value  Xi  of  the  variable  X  is  the 
product  of  Xi  and  the  probability,  p4  that  X  takes  the  value  Xi.  If 
E  (Xi)  denotes  the  expected  value  of  Xt,  then 

E(X0  =piXi  II.  13. 1. 

Since  the  expected  value  of  a  sum  is  the  sum  of  the  expected 
values,  it  follows  that  the  expected  value  E  (X)  of  a  variable  X 
which  may  assume  a  set  of  values  X4  (i  =  1,  2, ,  n)  with  cor- 
responding probabilities  pi  (i  =  1,  2,  . . . . ,  n)  is 

E(X)=SiPtX,  II.  13.  2. 

i 

11.14.  Deviation  from  Arithmetic  Mean.  An  important  character- 
izing property  of  the  arithmetic  mean  is  that  the  algebraic  sum  of 
the  deviations  of  the  values  from  the  arithmetic  mean  is  equal  to 
zero.  This  property  is  true  for  no  other  average. 

To  illustrate :  Let  it  be  required  to  find  the  mean  weight  of  four 
men,  who  weigh  respectively  128,  140,  150,  and  190  pounds.  Their 
arithmetic  mean  weight  is 

-         128  +  140  +  150  +  190 

X  = - - =  152  lbs. 
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The  differences  between  the  individual  weights  of  these  four 
men  and  their  arithmetic  mean  weight  are: 


eights 

Algebraic  Differences 

X 

X  — X 

190 

38 

150 

—    2 

140 

—  12 

128 

—  24 

Sum  =  0 

The  above  demonstration  may  be  stated  in  the  form  of  a  Theo- 
rem :  The  sum  of  the  algebraic  differences  between  the  values  of  a 
variable  X  and  their  arithmetic  mean  X  is  equal  to  zero. 

Let  Xi  (i  =  1,  2,  . .  . ,  k)  be  the  values  of  the  variable  X,  let  f4 
(i  =  1,  2,  . . . . ,  k)  be  the  corresponding  frequencies  and  let  X  be 
the  arithmetic  mean.  Then 


But 


|ifi(X,. 

-X)  = 

=SifiX, 

1 

-XSifi. 

i 

k 

Si  f  i  =  n 

i 

k 

and  Si  fi  Xi  = 

i 

nX^ 

Si  fi  (X,  - 

-X)  = 

:nX  — ] 

iX  =  0. 

Hence 


This  Theorem  may  be  expressed  in  terms  of  mathematical  ex- 
pectation as  follows :  The  expected  value  E  {  X  —  E  (X)  }  of  the 
deviations  of  a  variable  from  its  expected  value  E  (X)  is  zero,  that  is: 

E  {  X  —  E(X)}  -0  II.  14.  1. 

Another  characteristic  of  the  arithmetic  mean  is  its  additive 
property.  The  meaning  of  this  property  may  be  made  clear  by 
finding  the  mean  of  two  sets  of  given  values.  Let  the  first  set  be 
115,  128,  140  and  the  second  be  150,  190. 

115+128+140 
The  arithmetic  mean  of  the  first  set  is =  127  2/3 
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150  +  190  _         .  _ 

and  of  the  second  set  is =  170.  The  arithmetic  mean  of 

2 

115  +  128  +  140  +  150  +  190 
the  composite  of  the  two  sets  is =  14%- 

But  the  weighted  arithmetic  mean  of  the  two  arithmetic  means 


is 


3  (127f)  +  2  (170) 


=  144 


3+2  5 

This  illustrates  a  theorem :  The  arithmetic  mean  of  the  sum  of  two 
variables  is  the  weighted  arithmetic  mean  of  their  arithmetic  means. 
Symbolically :  If  Xx  is  the  arithmetic  mean  of  the  first  set  having  nx 
values  and  X2  is  the  arithmetic  mean  of  the  second  set  having  n2 
values  and  if  X^  +  x2  is  the  weighted  arithmetic  mean  of  the  two 
arithmetic  means,  then 

Xx,+x-,=niXl+n2^  =  X,  II.  14.  2. 

nx  +  n2 

where  X  is  the  arithmetic  mean  of  the  nx  +  n2  values.  This  may  be 
generalized  to  any  number  of  variables. 

In  terms  of  expected  values  the  theorem  is  stated  as  follows :  The 
expected  value  of  the  sum  of  two  variables  is  the  sum  of  their  expected 
values,  that  is : 

E  (X±  +  X2)  =  E  (XJ  +  E  (X2).  II.  14.  3. 

To  illustrate  another  theorem,  reconsider  the  set  of  values  115, 
128,  140.  If  we  multiply  each  value  by  2,  we  have  the  values  230, 
256,  280.  The  arithmetic  mean  of  115,  128,  140  each  multiplied  by 
2  is 

230  +  256  +280  J115  +  128  +  1401 


2  (127i 


3 

The  theorem  is :  The  arithmetic  mean  of  a  constant  times  a  variable 
is  equal  to  the  constant  times  the  arithmetic  mean  of  the  variable. 

In  terms  of  expected  values  the  theorem  is :  The  expected  value  of 
a  constant  times  a  variable  is  equal  to  the  product  of  the  constant  by  the 
expected  value  of  the  variable,  that  is: 

E  (cX)  ==  cE  (X)  II.  14.  4. 
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Let  us  reconsider  the  arithmetic  mean,  namely: 

k 

"      _  Ei  *  1    Xi  f  f  r 

X  =  i -2xl+3xl  + +-Xk 

n  n  n  n 

k   fi 
where  2i  —  —  *• 
1   n 

It  is  important  to  note  that  the  coefficients  of  the  Xi,  namely, 
the  fj/n,  are  the  relative  frequencies  of  occurrence  of  these  values. 
But  from  the  definition  of  statistical  probability  (see  Chapter  III), 
the  limiting  values  of  the  fj/n,  as  n  becomes  large  beyond  all  bounds, 
are  the  pi,  where  pi  is  the  probability  of  occurrence  of  a  value  Xi 
of  X  among  a  set  of  mutually  exclusive  values  Xi. 
Symbolically : 

Vf    Y 

E  (X)  =limX  =  lim  — - — -  =  SpiXi  II.  14.  5. 

n— >  oo         n— ►  oo          ** 

where  pi  Xi  is  the  expected  value  of  a  particular  value  Xi  of  X  and 
Si  pi  Xi  is  the  sum  of  the  expected  values  of  the  different  par- 
ticular values  Xi  of  X.  But  the  sum  of  expected  values  is  the  ex- 
pected value  of  the  sum,  and  is  called  the  mathematical  expectation. 
It  is  also  known  as  the  probable  or  expected  value  of  the  variable. 

It  also  follows  from  II.  14.  5.  that  the  arithmetic  mean  X  of  a 
sample  is  an  approximation  to  the  probable  or  expected  value, 
namely,  the  true  or  universe  value. 

The  arithmetic  mean  is  most  important  in  estimating  and  pre- 
dicting. The  arithmetic  mean  X  of  a  sample  is  the  unbiased  estim- 
ator (a  value  whose  expected  value  is  the  true  value)  of  the  true 
mean  of  the  population — the  latter  being  E  (X). 

To  illustrate:  Suppose  we  have  a  considerable  number  of  observa- 
tions of  the  speeds  in  miles  per  hour  of  vehicles  passing  a  given  point. 
These  may  vary,  say,  from  19  miles  per  hour  up  to  70  miles  per 
hour.  Suppose  we  wish  to  answer  the  question :  At  what  speed  in 
miles  per  hour  will  a  vehicle  pass  this  point  ?  The  answer  definitely 
is  the  expected  value  if  we  have  the  "universe",  or  the  arithmetic 
mean  if  we  have  a  random  sample  of  the  observed  speeds.  The 
arithmetic  mean  is  the  only  one  of  the  averages  for  a  set  of  measure- 
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ments  that  is  an  expected  value.  Furthermore,  no  quantity  is  of 
any  real  value  for  predicting  purposes  unless  it  is  a  probable  or 
expected  value  or  unless  as  determined  from  a  sample  it  is  an 
optimum  or  unbiased  estimator.  An  optimum  estimator  is  one  that 
is  consistent,  efficient,  and  sufficient. 

Another  important  theorem  concerned  with  expected  values  is : 
The  expected  value  of  the  product  of  two  mutually  independent  vari- 
ables is  the  product  of  their  expected  values.  To  illustrate: 

Toss  three  pennies  and  throw  three  dice.  The  number  of  heads 
occurring  with  the  corresponding  probabilities  is  shown  in  Table 
II. 3.  Likewise,  the  number  of  one  spots  occurring  with  the  corres- 
ponding probabilities  is  shown  in  Table  II. 3. 


Table  II.3 


Pennies 

Dice 

No. 

No.  of 

of  Heads 

Probability 

One  Spots 

Probability 

X 

Pi 

Y 

P2 

0 

Va 

0 

125/ 
/216 

1 

3/s 

1 

75/ 

/216 

2 

3/8 

2 

15/ 
/216 

3 

Vt 

3 

/216 

Table  II.4 
Expected  Values 


Pennies 

Dice 

X 

PiX 

Y 

p2Y 

0 

0 

o 

0 

1 

3/s 

1 

75/ 
/216 

2 

6/s 

2 

30/ 

/216 

3 

78 

3 

/216 

E(X) 

3/2 

E(Y) 

V. 
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In  Table  II. 4  is  shown  the  expected  number  of  times  for  the 
different  possibilities  for  number  of  heads  occurring  as  well  as  the 
expected  number  of  heads.  Also,  there  is  shown  the  expected 
number  of  times  for  the  different  possibilities  for  number  of  one 
spots  occurring  as  well  as  the  expected  number  of  one  spots. 

Table  II. 5  lists  for  the  compound  event  the  expected  number  of 
times  for  the  different  possibilities  for  number  of  heads  and  one 
spots  occurring  as  well  as  the  expected  number  of  heads  and  one 
spots. 

Table  II.5 
Expected  Values 

Dice  and  Pennies 


Heads 
X 

One  Spot 
Y 

Compound  Probability 
P1P2 

XYPlp2 

0 

0 

125/ 
/1728 

0 

0 

1 

75/ 

/1728 

0 

0 

2 

15/1728 

0 

0 

3 

Vl728 

0 

1 

0 

375/ 
/1728 

0 

1 

1 

225/ 
/1728 

225/ 
/1728 

1 

2 

47l728 

80/ 
/1728 

1 

3 

7X728 

/1728 

2 

0 

375/ 

/1728 

0 

2 

1 

225/ 
/1728 

450/ 

/1728 

2 

2 

45/ 
/1728 

180/  . 
/1728 

2 

3 

3/l728 

18/ 
/1728 

3 

0 

125/ 
/1728 

0 

3 

1 

75/ 
U728 

225/ 

/1728 

3 

2 

15A728 

9%728 

3 

3 

Vl728 

7l728 

E  (XY)  =  1296/! 


'/« 


From  the  above  tables,  it  is  seen  that  [E  (X)  =  |]  [E  (Y)  =  *■] 
=  [E  (XY)  =  }]  which  symbolically  is, 

E  (XY)  =  E  (X)  E  (Y).  II.  14. 6. 

In  the  case  of  two  samples  of  data :  The  arithmetic  mean  of  the 
product  of  two  mutually  independent  variables  is  the  product  of  their 
arithmetic  means. 
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This  theorem  may  be  generalized  to  any  number  of  mutually  in- 
dependent variables. 

II.  15.  The  Deviations  from  Any  Arbitrary  Value.  The  arithmetic 
mean  of  all  the  deviations  from  any  arbitrary  number,  added  to 
that  number  is  the  arithmetic  mean  of  the  values.  This  theorem 
may  be  explained  by  considering  the  weights  of  five  persons  who 
weigh  respectively  135,  175,  180,  185,  190.  Suppose  we  select 
X0  =  180  as  the  arbitrary  number,  then 


X            i 

I                  x"  =  X-3 

135             ] 

L                           —45 

175             ] 

L                          —    5 

180             1 

L                                 0 

185             ] 

L                                 5 

190             ] 

L                               10 

n  =  5  —  35 

and  X  =  180--¥-  =  173. 

This  is  a  much  shorter  method  than  adding  all  the  items  and 
dividing  by  their  number. 

Symbolically  the  theorem  may  be  expressed  as 

X=Xo  +2x,7n 
where 

X0  =  any  arbitrary  value  but  usually  a  guessed  mean  meaning 
that  it  is  as  near  the  actual  mean  as  can  be  estimated. 

x"  =  deviation  of  each  value  from  X0,  the  estimated  mean. 

n     =  number  of  cases  (individual  values). 

II.  16.  Mean  Values  in  General.  A  Mean  Value  in  general  may  be 
thought  of  as  the  centroid  of  a  frequency  diagram.  Let  y  =  f  (x) 
be  continuous  in  the  x — interval  (a,  b). 

Divide  (a,  b)  into  n  equal  parts,  of  length  Ax  and  let  yA  (i  =  1,  2, 
. . . . ,  n)  be  the  value  taken  by  y  in  the  ith  part.  The  arithmetic 
mean  of  the  numbers  yl5  y2,  . . . . ,  yn,  that  is 

v  =  pl±Zjl±  ••••  +yi  +  ----+y-       IL  16> L 
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a      ax 
FlGUBE  II.  8 

Graphical  Representation  of  the  Mean  Value 


^x 


will  approach  a  definite  limit  as  n  tends  to  infinity.  If  the  numer- 
ator and  denominator  of  II.  16.  1.  are  multiplied  by  Ax,  its  form 
is  changed  to 

yiAx  +  y2Ax  +....+  yiAx  +  .  ■  ■  ■  +  ynAx    n  ^  g 
nAx 
But  nAx  =  b  —  a  and  the  area  A  under  the  curve  between  the 
limits  a  and  b  is 


A  =  Limit  (yiAx  +  y2Ax  + 

Ax  — o 
n— ►  oo 

dA=      ydx. 

a  t/a 

Hence,  the  mean  value  y  of  y  is 

n  b 

SiyiAx       fydx 
y  =  Limit-1  -^ 


+  y^x  + +  ynAx) 


nAx 


b  —  a 


II.  16.  3. 


Likewise,  the  mean  value  X  of  X  is  found  by  taking  first  moments 
about  the  y-axis,  namely: 


AX 


-x 


x  d  A,     whence 
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_        f  x  y dx 

X=^ II.  16.  4. 

Jaydx 

II.  16.  2.  may  be  interpreted  as  the  average  weight  of  nAx 
objects  having  various  weights  where  Ax  objects  have  a  weight  of 
jv  Ax  have  a  weight  of  y2,  .... 

II.  16.  3.  may  also  be  obtained  by  the  use  of  moments  as  illus- 
trated in  Figure  II.  8.  Here  yiAx  objects  have,  say,  a  distance  xi. 
The  moment  of  yiAx  about  the  y-axis  is  xiyiAx.  The  moment  of 

n 

the  whole,  if  x  is  its  distance,  is  x  (b  —  a)  and  also    Si  xi  yi  Ax. 

n  n\> 

Hence :      x (b  —  a)  =  lim  2  xi  yi  Ax  =  I  xydx, 

b 


SxiyiAx        Jaxydx 


whence:    x  =  lim 

ax_>o     b  —  a  b  —  a 

The  notion  of  mean  is  readily  extended  to  functions  of  two  or 
more  variables.  To  see  this  generalization,  the  reader  is  referred  to 
any  book  on  Calculus  or  Mechanics. 

II.  17.  The  Mode.  The  mode  or  modal  value  of  a  variable  is  that 
value  of  a  variable  which  occurs  most  frequently,  if  such  a  value 
exists.  It  is  the  most  probable  value,  or  in  other  words,  the  value 
for  which  the  frequency  is  a  maximum.  The  expression  most  prob- 
able value  when  it  refers  to  the  number  of  successes  in  n  trials  is 
used  in  the  general  theory  of  probability  to  designate  the  number 
to  which  there  corresponds  a  larger  probability  of  occurences  than 
to  any  other  number.  The  point  at  which  the  frequency  is  most 
dense  is  the  abscissa  of  the  maximum  point  of  the  frequency  curve 
and  can  be  determined  accurately  only  from  the  equation  of  the 
curve. 

For  a  given  grouping  the  class  mark  of  the  maximal  class  fre- 
quency is  called  the  empirical  mode. 

An  approximation  to  the  mode  may  be  obtained  by  passing  a 
parabola  through  the  midpoints  of  the  upper  bases  of  the  modal 
class  and  the  two  adjacent  classes.  Figure  II.  9.  shows  three  such 
points  h,  i,  j. 
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The  general  equation  of  a  parabola  with  its  axis  parallel  to  the 
y-axis  is 

y  =a  +(3x  +yx2.  H.  17.  1. 

In  Figure  II.  9.,  take  the  origin  at  the  point  0,  namely,  at  the 
lower  limit  of  the  modal  class.  Let  c  equal  the  class  interval  and 
Ax  =  OG  and  A2  =  ED.  When  x  =  —  c/2,  y  =  0;  x  =  c/2, 
y=Ax;  x  =  3  c/2,  y  =  Ax  —  A2.  Substitute  these  values  for 
x  and  y  in  II.  17.  1.  and 

0=a-p(c/2)  +Y(l) 

Ax  =  a  +  (3  (c/2)  +  Y  (t)  H.  17.  2. 

A1  —  A2=a+p(3c/2)  +  Y  (9t) 

Solving  these  equations  for  a,  (3,  y» 

5  A,  +  A2  A,  Ax  +  A2  TT 

a= *  2;         p=  — ;        Y= x  2  II.  17.  3. 

8  H        c  r  2  c2 

The  maximum  point  on  the  curve  y  =  a  +  (3x  +  y*2  is  found 
by  setting 

dy/dx  =  p  +  2  Tx  =  0  II.  17.  4. 

d2y/dx2  =  2  T  <  0 
From  II.  17.  4., 

x  =  —  p/2  y  II.  17.  5. 

Y<0 

Substituting  the  values  for  p  and  y  from  II.  17.  3.  in  II.  17.  5., 


A+A2y 

The  quantity  found  for  x  in  II.  17.  6.  when  added  to  the  lower 
limit  of  the  modal  class  is  the  approximate  value  of  the  mode, 
namely 

Mode  =  1.  +  ( — )  c  II.  17.  7. 

\Aj  +  A2/ 

where 

lx  z=  lower  limit  of  the  class  with  maximum  frequency. 
Aj  =  f0  —  fx  (See  Figure  II.  9.) 
A2  =  f0  —  fr  (See  Figure  II.  9.) 
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In  Table  II.  1.,  f 0  =  74,  fi  =  60,  fr  ==  29.  Substituting  these 
values  in  II.  17.  7.,  we  obtain 

Mode  =  39.5  +  ( )  5  =  40.7.  II.  17.  8. 

\14  +  45/ 

The  graphical  counterpart  of  the  solution  just  given  for  finding 
the  mode  is  as  follows.  Consider  the  distribution  given  in  Table  II.  1 . 
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34.5  39.5  44.5  495 

Speed  in    Mrles    Per  Hour 
FlGUBE  II.  9 

Graphical  Solution 
for  Finding  the  Modal  Value  oe  a  Set  of  Observations 


From  this  table  select  the  modal  class  and  the  class  adjacent  to  it 
on  either  side  of  it  and  for  these  three  classes  plot  on  graph  paper 
these  three  frequency  rectangles  as  illustrated  in  Figure  II.  9. 
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Connect  the  points  G  and  E  with  a  straight  line  and  the  points  0 
and  D  with  a  straight  line.  Then  from  the  point  of  intersection  of 
these  two  lines  drop  a  perpendicular  to  the  horizontal  axis.  The 
number  read  on  the  horizontal  scale  at  the  point  where  this  per- 
pendicular cuts  the  horizontal  scale  is  the  graphical  solution  of  the 
mode.  In  this  case  it  is  40.  8.  Comparing  the  value  of  the  mode 
found  graphically  with  the  value  of  the  mode  just  found  arithmetic- 
ally, it  is  seen  that  the  difference  is  0.1,  which  is  negligible. 

It  is  not  difficult  to  show,  that  the  abscissa  of  the  point  of  inter- 
section of  the  lines  joining  OD  and  GE  is 

W  +  A2) 

which  proves  that  the  graphical  solution  given  is  theoretically  the 
same  as  the  analytical. 

It  is  obvious  that  for  most  practical  purposes  since  graphically 
the  value  of  the  mode  can  be  obtained  with  slight  error  the 
graphical  solution  of  the  mode  will  suffice.  This  result  means  that 
the  most  probable  speed  of  a  vehicle  at  the  point  observed  is  40.7  miles 
per  hour.  In  other  words,  more  vehicles  pass  this  point  at  a  speed 
of  40.7  miles  per  hour  than  at  any  other  speed. 

II.  18.  Median.  The  median  of  a  variable  is  a  number  which  is  such 
that  half  of  the  measurements  have  a  value  less  than  it  and  the 
other  half  have  a  value  greater  than  it.  It  is  thus  the  abscissa  of  the 
point  the  vertical  through  which  divides  the  total  area  under  the 
frequency  curve  or  frequency  rectangles  into  two  equal  parts.  To 
compute  the  median  of  a  sample  set  of  n  values  of  the  variable, 
compute  the  abscissa  of  a  point,  the  vertical  through  which  divides 
the  total  area  of  the  frequency  rectangles  into  two  equal  parts. 

Illustration: 

From  columns  (1)  and  (6)  in  Table  II.  1.,  and  from  Figure  II.  10., 
it  is  seen  that  the  sum  of  the  frequencies  (sum  of  the  areas)  of  the 
classes  up  to  X  =  34.5  is  106  and  the  sum  of  the  frequencies  (sum 
of  the  areas)  of  the  classes  up  to  X  =  39.5  is  166.  But  one-half  the 
total  frequency  is  150  which  is  between  106  and  166.  Hence  the 
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Speed   in  Miles    Per  Hour 
Figure  II.  10 
Median  Value  of  Observed  Vehicle  Speeds 

median  value,  by  definition,  lies  between  X  =  34.5  and  X  =  39.5 
at  a  point  which  is  the  same  proportion  of  the  distance  from 
X  =  34.5  to  X  =  39.5  as  150  is  from  106  to  166. 
Symbolically  it  is  seen  that 


Median  =  lt  + 


n/2-fc 

fm 


II.  18.  1, 


where 

l1    =  lower  bound  of  class  in  which  median  value  falls, 
n     =  total  frequency. 

fcli   =  cumulative  frequency  to  lower  limit  of  class  in  which 
median  value  lies. 
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f  m    =  frequency  of  class  in  which  median  lies. 

c      =  length  of  class  interval. 

Hence  for  the  given  distribution 

/150— 106\ 

Median  =  34.5  +   5  =  38.2       II.  18.  2. 

\        60        / 

II.  19.  Quantiles:  Quantiles  are  location  and  division  numbers. 
They,  like  the  median,  divide  the  distribution  into  sections.  There 
are  many  quantiles,  but  we  shall  mention  and  briefly  discuss  only 
those  frequently  used.  There  are  the  quartiles  (quarters),  quintiles 
(fifths),  deciles  (tenths),  and  percentiles  (hundredths).  The  method  of 
finding  them  is  similar  to  that  of  finding  the  median. 

A  quantile  value  (or  percentile)  is  a  number  such  that  the  speci- 
fied quantile  (percentage)  proportion  of  cases  have  a  measure  less 
than  it  and  the  remainder  have  a  measure  greater  than  it.  Sym- 
bolically, 


Quantile  =  lx  +  I z — -  I  c  II.  19.  1. 

where 

lx    =  lower  bound  of  class  in  which  quantile  value  falls. 

k     =  proportion  of  cases  below  specified  quantile  value. 

n      s=  total  frequency. 

fcli  =  cumulative  frequency  to  lower  limit  of  class  in  which 
quantile  value  lies. 

fq  =  frequency  of  class  in  which  the  specified  quantile  value 
lies. 

To  illustrate:  It  is  desired  to  find  the  lower  quartile  Qx  or  the 
25th  percentile  and  the  upper  quartile  Q3  or  the  75th  percentile. 

In  the  former  case,  k  =  J-,  and  from  columns  (1)  and  (6)  of  Table 
II.  1.,  it  is  seen  that  fclj  =  43  and  fq  =  63  and  l1  =  29.5.  Hence 
II.  19.  1.  becomes 

Ql  =  29.5+p^?)5  =  32.0  II.19.2. 

In  the  latter  case,  k  =  j,  it  is  seen  that  fdi  =  166  and  fq  =  74 
and  lx  =  39.5.  Here  II. 19.1.  becomes 


3 
^3  ?4 


/i(300)—  166\  „1A. 

39.5  +    — 5  =  43.5.  II.19.3. 
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These  two  values  mean  that  25  per  cent  of  the  vehicles  at  the  ob- 
served point  had  a  speed  less  than  32.0  miles  per  hour  and  25  per 
cent  of  the  vehicles  had  a  speed  greater  than  43.5  miles  per  hour. 

If  it  is  desired  to  know  the  4th  decile,  then  k  =  0.4  in  II.  19.1. 
and  if  it  is  desired  to  know  the  thirty-second  percentile,  then 
k  =  0.32.  In  other  words  the  4th  decile  means  a  speed  such  that 
0.4  of  the  vehicles  have  a  lower  speed  and  0.6  a  higher  speed  and 
the  thirty-second  percentile  means  a  speed  such  that  32  per  cent 
have  a  lower  speed  and  68  per  cent  a  greater  speed. 

Having  found  the  values  of  the  arithmetic  mean,  the  median 
and  the  mode,  what  are  the  differences  in  their  values  and  mea- 
nings ?  It  can  be  proved  that  the  median  value  always  lies  between 
the  arithmetic  mean  and  the  mode  such  that  either 

X  ^  Median  ^  Mode  or 

Mode  ^Median  ^X  II.19.4. 

For  the  distribution  of  Table  II.  1.,  it  was  found  that  X  =  38.2., 
the  Median  =  38.2.,  the  Mode  =  40.7  miles  per  hour.  The  apparent 
equality  of  the  median  and  arithmetic  mean  in  this  sample  is  due 
primarily  to  grouping  and  sampling  errors  and  to  some  extent  due 
to  experimental  error.  The  modal  value  of  40.7  reveals  that  a 
greater  proportion  of  the  vehicles  at  the  point  observed  travel  at 
a  speed  greater  than  the  probable  or  expected  speed  of  38.2  miles 
per  hour.  This  observed  tendency  is  important  and  can  and  must 
be  explained  from  a  subjective  study.  The  other  results  show  that 
25  per  cent  of  vehicles  travelled  with  a  speed  less  than  32.0  miles 
per  hour  and  25  per  cent  with  a  speed  greater  than  43.5  miles  per 
hour  and  50  per  cent  with  a  speed  of  from  32.0  to  43.5  miles  per 
hour.  The  lower  25  per  cent  had  a  range  in  speed  of  32.0  —  14.5 
=  17.5  miles  per  hour,  the  middle  50  per  cent  had  a  range  of 
43.5  —  32.0  =  11.5  miles  per  hour,  and  the  upper  25  per  cent  had 
a  range  in  speed  of  74.5  —  43.5  ==  31.0  miles  per  hour.  Similarly, 
the  second  25  per  cent  had  a  range  in  speed  of  38.2  —  32.0  =  6.2 
miles  per  hour  and  the  third  25  per  cent  a  range  of  43.5  —  38.2  = 
5.3  miles  per  hour.  These  results  indicate  rather  plainly  a  lack  of 
stability  and  uniformity  in  speeds  due  to  drivers,  type  of  vehicles, 
and  topography  at  point  observed. 
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II.  20.  Geometric  Mean.  The  geometric  mean  of  a  set  of  n  positive 
measurements  is  the  nth  root  of  their  product.  If  XA  (i=  1, 2,  . . . .  n) 
are  the  n  values  for  a  variable  X,  the  geometric  mean, 

G.M.  =  (111X0°  =  (X^Xg  . . . .  -Xn)n  II.20.1. 

where  II  is  the  symbol  for  the  product. 
For  a  frequency  distribution, 

G.M.  =  (X§    Xf'  • ....  Xf1- ... .  -X[k)n  II.  20.  2. 

k 

where  ^rfi  =  n.   It  is  significant  that  the 

log.  G.M.  =  fi  lQgxi  +  fs  lQgX2  +••••+  fk  logXk 

n 

SifilogXi 


=  - II.20.3. 

n 

This  means  that  the  logarithm  of  the  geometric  mean  is  the  arith- 
metic mean  of  the  logarithms  of  the  measurements.  Recalling  the 
relationship  between  relative  frequency  and  probability,  it  is 
evident  that  as  the  number  of  measurements  is  indefinitely  in- 
creased the  logarithm  of  the  geometric  mean  becomes  the  probable 
or  expected  value  of  the  logarithm  of  the  variable  X. 

For  analyzing  a  frequency  distribution,  the  geometric  mean  has 
no  immediate  value.  The  geometric  mean  is  the  average  of  a  set  of 
rates  and  is  the  only  average  which  is  the  average  of  a  set  of  rates 
or  the  average  of  a  set  of  things  that  behave  like  rates.  Two  ex- 
amples will  illustrate  this  property: 

(1)  A  city  had  a  population  in  1900  of  100,000  and  in  1910  of 
120,000.  What  is  the  average  annual  rate  of  increase  in  population  ? 
This  problem  is  analagous  to  a  problem  in  compound  interest 
where  the  amount,  principal,  and  time  are  known  and  the  rate  of 
interest  is  to  be  found.  Hence 

Pn  =  P0(l  +  r)n  11.20 .4. 

where 

Pn  =  the  population  at  the  end  of  n  years. 

P0  =  the  population  at  the  beginning  of  the  period. 

n    =  number  of  time  intervals. 
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Substitute  the  above  values  in  II. 20.4.,  then 

120,000  =  100,000  (1  +  r)10 
Solving  for  r,  it  is  found  that 

r  =  .0184  =  1.84%  change  per  annum. 

(2)  Given  the  information  shown  in  tabular  form : 
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Community 

Native  Born 
Inhabitants 

Foreign  Born 
Inhabitants 

Ratio  of 
Foreign  Born 
to  Native  Born 

Ratio  of 
Native  Born  to 
Foreign  Born 

A 

a  =  9000 

c  =  4500 

c/a  =  50% 

a/c  =  200% 

B 

b  =  2000 

d  =  4000 

d/b  =  200% 

b/d  =  50% 

It  may  be  shown  that  the  arithmetic  mean  is  not  the  average 
rate  of  increase. 

The  arithmetic  mean  of  the  ratios  of  Foreign  Born  to  Native 
born  is 

50%  +  200%  _        Q    _c/a  +  d/b_cb  +  ad 
2  /0  2  2ab 

The  arithmetic  mean  of  the  ratios  of  Native  born  to  Foreign 
born  is 

200%  +  50%  __  a/c  +  b/d      ad  +  be 

2  /0  2  2cd 

Since  the  product  of  these  two  results  is  not  unity  or  100%,  they 
are  illogical  and  the  arithmetic  mean  is  not  the  proper  average  to 
use. 

The  geometric  mean  of  the  ratios  of  Foreign  born  to  Native 
born  is 


G.M.  =  y. 50 -2.00  =  1.00  =  100%  =  ]/c/ad/b  =  ]/cd/ab. 
The  geometric  mean  of  the  ratios  of  Native  born  to  Foreign 


born  is 


G.M.  =  y2.00-  .50  =  1.00  =  100%  =  [a/c-b/d  =  |/ab/cd 
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The  product  of  these  two  results  is  unity  or  100%. 

c  +  d       4500  +  4000         8500 

Now = =  =  .7727  =  77.27%  and 

a+b       9000+2000       11000  ' 

a+b       9000+2000       11000 

— —  = = =  1.2941  =  129.41  %. 

c  +  d       4500  +  4000         8500 

_         c  +  d   a  +  b 

But     — ■ =  1  and  .7727  times  1.2941  =  1. 

a+b    c  +  d 

Since  the  product  of  the  ratios  must  be  unity,  it  is  seen  that  the 

geometric  mean  is  the  average  rate. 

11.21.  Harmonic  Mean.  The  harmonic  mean  of  a  set  of  measures 
is  the  reciprocal  of  the  arithmetic  mean  of  the  reciprocals  of  the 
measures. 

Symbolically,  if  H  M.  is  the  harmonic  mean, 

H.M.=;- — — -  II.  21.1. 

fl/xl  +  f2/x2  + +  fk/xk 

To  illustrate:  Suppose  we  have  a  vehicle  that  travels  25  miles 

per  hour  for  20  miles,  then  30  miles  per  hour  for  10  miles,  then 

50  miles  per  hour  for  50  miles,  then  40  miles  per  hour  for  10  miles 

and  finally,  12  miles  per  hour  for  10  miles.  What  is  the  average 

speed  of  this  vehicle  for  the  100  miles  travelled  ?  It  is  the  harmonic 

mean,  namely, 

100 

H.M.  = 

20  (1/25)  +  10  (1/30)  +  50  (1/50)  +  10  (1/40)  +  10  (1/12) 

=  31.1  miles  per  hour. 

This  average  speed  may  be  found  by  an  arithmetic  mean  method 

if  weights  are  properly  chosen.  If  X'  is  the  symbol  for  the  average 

speed  for  an  arithmetic  mean  method, 

X'  = 

25{(.04)(20)}  +  30{(.033)(10))  +  50{(.02)(50)}+40{(.025)(10)}  +  12{(.083)(10)} 

3721 

25  (.8)  +  30  (.333)  +  50  (1)  +  40  (.25)  +  12  (.833) 


3.21 
20.000  +  9.999  +  50.000  +  10.000  +  9.996 


3.21 
where  0.8,  0.333,  1,  0.25,  and  0.833  are  the  weights. 


31.1  miles  per  hour 


SUMMARIZING  OF  DATA  45 

The  latter  method,  while  it  solves  the  problem,  is  not  as  direct  and 
simple  as  the  harmonic  mean.  Of  all  the  averages,  the  harmonic 
mean  is  the  only  one  that  is  the  average  time  rate  or  the  average 
of  things  that  behave  like  time  rates. 

II.  22.  Root  Mean  Square.  The  root  mean  square  R.M.S.,  a,  often  called 
the  standard  deviation  in  statistics  is  similar  to  the  radius  of  gyration  k 
in  mechanics.  The  radius  of  gyration  of  the  area  under  a  frequency 
curve  about  the  ordinate  through  the  center  of  gravity  of  that 
area  is,  in  fact,  equal  to  <r. 

The  physical  meaning  of  radius  of  gyration  is  that  it  is  a  distance 
such  that  if  all  the  mass  of  a  body  (or  area)  were  concentrated  at  a 
point  that  distance  from  an  axis  of  rotation  it  would  have  the 
same  rotational  effect  as  the  actual  distributed  mass  (area).  It  is 
also  the  root  mean  square  of  the  radial  distances  of  a  set  of  n  equal 
particles  from  an  axis.  In  the  same  way,  <?,  the  standard  deviation 
of  a  frequency  distribution  (area)  thought  of  as  a  set  of  n  equal 
particles  of  area  is  the  square  root  of  the  arithmetic  mean  of  the 
squares  of  the  radial  distances  of  the  several  particles  from  the 
centroidal  axis,  that  is,  it  is  the  R.M.S.  as  well  as  h  with  respect 
to  the  centroidal  axis. 

It  is  believed  that  a  review  of  the  significance  of  second  moments 
and  the  radius  of  gyration  k  in  mechanics  will  help  to  understand 
the  corresponding  terms  in  statistics. 

Let  A  be  any  area  and  YY  an  axis  through  the  centroid  0  as 
shown  in  Figure  II.  11. 

Let  dA  represent  an  element  of  area  and  let  x  be  its  distance 
from  the  centroidal  axis  YY. 

The  moment  of  inertia  IY  is  by  definition  the  sum  of  all  the 
x2  dA,  that  is, 

-I- 

and  the  radius  of  gyration, 

k2  =  ^  II.  22.2. 

A 

If  the  moment  of  inertia  of  an  area  with  respect  to  a  centroidal 


IY=  I  x2dA  11.22. 1 
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axis  is  known,  the  moment  of  inertia  with  respect  to  a  parallel  axis 
may  be  found  as  follows : 

In  Figure  11.11.,  let  Y'Y'  be  any  axis  parallel  to  YY  and  at  a 
distance  d  from  YY. 


y  y 

Fig  toe  II.  11 

Moment  of  Inertia 

of  an  Area  with  Respect  to  a  Parallel  Axis 


The  moment  of  inertia  of  the  element  dA  about  Y'Y'  is  equal  to 
(x  +  d)2  dA  and  Iy'  for  the  total  area  is 

IY'=  f(x  +  d)2dA 

=  fx2  dA  +  2  d  fxdA  +  d2  PdA   II.22.3. 
Ja        Ja       Ja 

=  IY  +  Ad2 
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since 


Ax  =  0. 


The  fact  that    I   xdA  =  0    may  be  comprehended  if  it  is  re. 

membered  that  for  every  element  dA  on  the  right,  there  is  an 
element  (dA)'  at  a  distance  x'  to  the  left,  such  that  x'  (dA)'  =  xdA. 
In  other  words,  we  may  think  of  the  area  as  being  balanced  about 
the  centroidal  axis. 

The  frequency  diagram  in  statistics  may  be  treated  in  the 
same  manner  as  an  area  is  treated  in  mechanics.  The  notation  is 
slightly  different  and  so  is  the  point  of  view  and  interpretation  as 
is  shown  in  Figure  11.12.  Otherwise,  the  procedure  is  the  same. 


Figure  II.  12 
Frequency  Diagram 

Using  the  notation  shown  in  Figure  11.12. 

cr2  =  k2=(l/n)Si(xi-x)2. 
i 
This  may  be  written  in  the  form 

2(72==2k2=(1/n2)£i.(xi_x.)J 


n.22.4. 


II.22.5. 


We  thus  see  that  the  standard  deviation  is  (1)  the  square  root 
of  the  arithmetic  mean  of  the  squares  of  the  differences  between 
the  measurements  and  their  arithmetic  mean  and  (2)  proportional 
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Table  II.6. 


SPEED   IN  MLLES  PER  HOUE  OF  FREE  MOVING  VEHICLES    ON   SEPTEMBER   11, 
1939,  IN  OAKLAWN,  ILLINOIS  ON  T7.S.H.   12  AND  20  AT  A  POINT  ONE  MILE  EAST 

OF  HARLEM  AVE. 


Speed  in  miles 

Number  of 

per  hour 

Vehicles 

S 

f 

s 

fs 

fs2 

70-74 

0 

6 

0 

0 

65-69 

0 

5 

0 

0 

60-64 

2 

4 

8 

32 

55-59 

15 

3 

45 

135 

50-54 

14 

2 

28 

56 

45-49 

29 

1 

29 

29 

40-44 

74 

0 

0 

0 

35-39 

60 

—  1 

—    60 

60 

30-34 

63 

—  2 

—  126 

252 

25-29 

29 

—  3 

—    87 

261 

20-24 

6 

—  4 

—    24 

96 

15-19 

8 

—  5 

—    40 

200 

300 

—  227 

1121 

Substitute  the  indicated  values  from  Table  II. 6.  in  11.22.10. 
then 


la 

f    300 


227\2 


300 


=  5  }/  3.7367  —  0.5726  =  5  (1.779) 
=  8.9  miles  per  hour. 

This  means  that  we  would  expect  the  speed  of  a  random  vehicle 
to  be  somewhere  between  38.2  —  8.9  and  38.2  +  8.9  miles  per  hour, 
namely,  between  29.3  and  47.1  miles  per  hour. 

From  an  examination  of  the  distribution  of  speeds,  we  find  that 
approximately  71  per  cent  of  the  vehicles  had  a  speed  between 
29.3  and  47.1  miles  per  hour.  Hence  this  relative  frequency  tells 
us  that  we  are  approximately  71  per  cent  certain  that  a  random 
vehicle  will  pass  the  intersection  with  a  speed  between  29.3  and 
47.1  miles  per  hour. 
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If  on  the  other  hand,  we  use  the  expected  speed  of  38.2  miles 
per  hour  as  our  estimate,  it  is  71  per  cent  certain  that  we  will  be 
in  error  by  at  most  cr/X  =  8.9/38.2  =  23.3  per  cent.  On  the  other 
hand,  it  is  29  per  cent  certain  that  the  error  is  at  least  23.3  per 
cent. 

This  indicates  that  there  is  marked  variability  in  speeds  and 
there  does  not  appear  to  be  a  typical  speed  at  all  for  this  point  on 
the  highway. 

II.  23.  Centra  Harmonic  Mean.  The  centra  harmonic  mean  is  a  meas- 
ure of  relative  dispersion.  It  is  the  arithmetic  mean  of  the  squares 
of  the  measures  from  an  arbitrary  origin  divided  by  the  arithmetic 
mean  of  the  measures.  Symbolically  if  C.H.M.  is  the  centra  har- 
monic mean,  then 

C.H.M.  =2ixi2  /Si  xi-  II.23.1. 

i  i 

The  centra  harmonic  mean  per  se  is  of  very  little  use  today. 
However,  a  quantity  similar  to  it,  namely  the  coefficient  of  vari- 
ability is  useful  as  a  measure  of  relative  dispersion  or  a  measure  of 
per  cent  of  error.  If  C.V.  is  the  symbol  for  coefficient  of  variability, 
then,  by  definition 


Si(Xi-X)2|         SiXi 


G 


C.V.  =]- f  -   - =«  II.23.2. 

n  J  n  X 

In  11.22.  the  C.V.  was  interpreted  for  the  distribution  given  in 
Table  ILL 

II.  24.  Mean  or  Average  Deviation.  The  mean  or  average  deviation 
from  an  average  is  the  A.M.  of  the  deviations  treating  them  all  as 
positive.  The  deviations  may  be  taken  from  any  average,  but  the 
mean  deviation  is  least  when  the  median  is  the  origin. 

In  case  of  a  normal  distribution  with  origin  at  the  arithmetic 
mean  or  median,  the  mean  deviation  is  the  abscissa  of  the  centroid 
of  area  under  the  right  hand  half  of  the  frequency  curve  and  its 
value  is  0.7978  g  =  0.8  g  approximately.  Assume  the  frequency 
for  each  class  concentrated  at  the  center  of  class  as  shown  in 
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Figure  11.13.  Let  the  distances  of  these  centers  from  the  center  of 
the  class  containing  the  median  be  dv  d2,  .... 

U 


L 


d;H 


Figure  11.13 
Mean  or  Average  Deviation  of  a  Set  of  Observations 


and  let  the  corresponding  class  frequencies  be  fx,  f2,  ...  so  that  the 
sum  of  moments  about  the  median  is  f^  -f-  f2d2  -f  . . .  +  fndn. 
Ignore  the  class  containing  the  median  for  the  present.  All  the  prod- 
ucts whose  deviations  lie  below  (to  the  left  of)  the  median  have 
deviations  too  short  by  an  amount  C  and  those  above  (to  the  right) 
are  too  long  by  an  amount  C.  Next  consider  the  sum  of  the  devia- 
tions below  the  median  class  and  above  the  median  class.  If  Na  is 
the  number  of  observations  above  and  Nb  the  number  below  the 
median  class,  then  we  have  as  a  first  correction 

(Nb-Na)C.  II.24.1. 
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If  Nm  is  number  of  observations  in  the  median  class  and  if  we 
assume  these  Nm  observations  uniformly  distributed  over  the 
interval,  then  (.5  +  C)  Nm  cases  are  below  and  (.5  —  C)  Nm  are 
above  the  median.  With  a  uniform  distribution,  the  sum  of  these 
deviations  below  the  median  is 

(.5+C)*Nm  (.6-C)*Nm 

and  above  the  median 

2  2 

Hence  the  sum  of  all  the  deviations  of  the  Nm  values  is 

(.5  +  C)*Nm+(.5-C)*Nm=(25  +  C2)Nm_      n242 

which  is  the  second  correction. 

Let  us  now  find  the  mean  deviation  from  the  median  for  the 
distribution  given  in  Table  II.  1. 

Table  II.7. 


SPEED  IN  MILES  PER  HOUR  OF  FREE  MOVING  VEHICLES    ON  SEPTEMBER   16, 
1939,  IN  OAKLAWN,  ILLINOIS,   ON  TJ.S.H.  12  AND  20  AT  A  POINT  ONE  MILE  EAST 

OF  HARLEM  AVE. 


X  =  s 

f 

x  =  s 

f|s|* 

70-74 

0 

7 

0 

65-69 

0 

6 

0 

60-64 

2 

5 

10 

55-59 

15 

4 

60 

50-54 

14 

3 

42 

45-49 

29 

2 

58 

40-44 

74 

1 

74 

35-39 

60 

0 

0 

30-34 

63 

—  1 

63 

25-29 

29 

—  2 

58 

20-24 

6 

—  3 

18 

15-19 

8 

—  4 

32 

300  =  n 

415 

The  symbol  |s|  means  the  numerical  value  of  s  which  is  always  positive  or  zero. 
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Correction  (1):  (Nb  —  Na)  C=  (106  -  134)  (1.2)  =  -  33.6 

Correction  (2):  (.25  + C2)  Nm=  (.25+ 1.44)  (60)  =  101.4 

Sum  of  deviations  for  classes  other  than  median  class  =  415.0 

Sum  of  all  deviations  482.8 

482.8 

Mean  Deviation  = =  1.609  class  intervals 

300 

=  8.05  =8.1  miles  per  hour. 
This  means  that  the  expected  value  of  the  difference  between 
the  speed  of  a  vehicle  and  the  median  value  of  speeds  is  8.1  miles 
per  hour. 

Given  N  values.  Choose  a  certain  number  as  origin  such  that  x 
of  the  values  will  be  greater  than  this  number.  Then  N  —  x  will 
be  less  than  the  selected  number.  Let  the  deviations  from  the 
selected  number  (average)  as  origin  be  A.  Displace  the  original 
origin  by  K  units  so  that  it  is  exceeded  by  only  x  —  1  values.  Then 
N  —  (x  —  1)  of  the  values  will  be  less  than  the  new  number.  By 
this  change,  the  sum  of  the  deviations  in  excess  of  the  selected 
number  is  decreased  by  Kx,  while  the  sum  of  the  deviations  less 
than  the  selected  number  is  increased  by  (N  —  x)  K.  If  A'  is  the 
new  sum  of  deviations,  then 

A'  =  A  +  (N  —  x)  K  —  Kx  and 
A'  =  A  +  (N  —  2  x)  K. 
If  x  =N/2;    A'  =A. 
If  x>  N/2;    A'  <A. 
This  proves  that  the  sum  of  the  numerical  values  of  the  devia- 
tions from  the  median  is  a  minimum. 

II.  25.  Moments  and  Mathematical  Expectation  of  Powers  of  a  Vari- 
able. 

The  moments  of  a  distribution  are  the  expected  values  of  the 
powers  of  the  stochastic  variable  which  has  the  given  distribution. 
The  term  "moment"  has  been  taken  over  by  the  statistician  from 
mechanics.  In  mechanics,  moment  is  a  measure  of  a  force  with 
respect  to  its  tendency  to  produce  rotation.  In  statistics  moments 
characterize  the  parameters  of  the  distribution  law  which  are  the 
properties  that  describe  for  interpretation  and  meaning  the  law  of 
behavior  of  the  attribute  that  is  being  measured  and  studied. 


SUMMARIZING  OF  DATA  55 

The  late  Karl  Pearson  (Biometrika,  Vol.  9,  pp.  1-10)  has  shown 
that  all  the  constants  of  a  frequency  distribution  are  expressible  in 
terms  of  higher  product  moments.  In  the  case  of  two  variates,  they 
are  defined  by 

vq,q.  =  SiJ{piix1*yi,'}  H.25.1. 

for  an  arbitrary  origin.  If  the  origin  is  at  the  mean,  namely,  at 
P  (x,  y),  then 

fiq,  Q.  =  £j  (pij  (Xi  -  I)«  (y,  -  yf}  II.25.2. 


In  case  of  a  single  variable,  the  k  th  moment  of  a  continuous 
variable  x  about  an  arbitrary  origin  denoted  by  v^  is 

vk  =  E  (xk)  =  P  xk  f  (x)  dx  II.25.3. 

and  in  the  case  of  a  discontinous  variable  x 

n 

vk  =  E  (xk)  =  SiPi  Xik.  II.25.4. 

As  has  been  seen,  the  first  moment  about  an  arbitrary  origin  is 
the  probable  or  expected  value  and  in  case  of  a  sample  it  is  the 
arithmetic  mean  of  the  x  values. 

The  k  th  moment  of  the  variable  x  about  an  arbitrary  point  a  is 
defined  as 

»b 


E  [(x  -  a)k]  =  j    (X  -  a)k  f  (x)  dx  II.25.5. 


or 


E  [(x  -  a)k]  =  Si  (xi  -  a)k  Pl.  II.25.6. 

If  a  is  the  arithmetic  mean  x  of  x  and  if  [Xk  is  the  symbol  for  the 
k  th  moment  about  the  mean,  then 

{ik=  E  [(x  -x)k]  =  E  [(x  -  Vl)k]  =  f  (X  -  vx)k  f (X)  dx  II.25.7. 

or 

[xk  =  E  [(x  -  vx)kJ  =  2iPl  (Xl  -  vx)k.  II.25.8. 

It  is  not  hard  to  see  that  [x2  =  a2. 
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It  is  easy  to  show  that  the  moments  about  the  mean  can  be  ex- 
pressed in  terms  of  the  moments  about  an  arbitrary  origin.  These 
relations  are : 

y.T=  S4  Pl  (x,  -  v2)r  =  f  (x—  vj1  f(x)  dx  II.25.9. 

1  Ja 

Specifically: 

H-2   =  V2  —  Vl2 

f*3   =   V3  —  3  Vl  V2    +  2  V 

^  =  v4  —  4  Vj  v3  +6  vx2  v2  —  3  vx4  11.25.10. 


'r\  .         „.  ,         /r\  r 


(J*=2. (i    (_  Vl)i  Vr"1 '  where  (i   =  J]  (r  — i)!  '  namely  the 
number  of  combinations  of  r  things  taken  i  at  a  time. 
For  a  sample 

vr=2ifiX[/n.  11.25.11. 

k  _ 

and  [lt=  Si  ft  (Xi  —  X)r/n.  11.25.12. 

Now  consider  the  translation     x'  —  X  —  X0,  and  if  vj.  =  the 
rth  moment  of  x',  then 

SiMxi)r 

Vr=  Si  fj  (Xj  -  X0)r/n=  ^ =  vi  11.25.13. 

i  n 

X X 

and  similarly  if  x= and  Vj  is  the  rth  moment  of  x 

c 

k  k 

Sifi(cx)r      cr2ifjxr 

vr  =  i =  -ri =  crv;  11.25.14. 

n  n 

Hence : 


ixr  =  |l(j\(-v;)tv;_i  H.25.15. 
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and 


(xr  =  cr^i\i 


v'i)ivr-i- 


11.25.16. 


To  illustrate:  Consider  the  distribution  of  Table  II.  1.  and  find 
the  first  four  moments  about  the  mean  using  11.25.10  and  11.25.16. 

Table  II.8. 


SPEED   IN  MILES  PER  HOUR  OF  FREE  MOVING  VEHICLES  ON    SEPTEMBER   16, 
1939  IN  OAKLAWN,  ILLINOIS,  ON  TJ.S.H.  12  AND  20  AT  A  POINT  ONE   MILE  EAST 

OF  HARLEM  AVENUE 


s 

f 

s 

fs 

fs2 

fs3 

fs4 

70-74 

0 

6 

0 

0 

0 

0 

65-69 

0 

5 

0 

0 

0 

0 

60-64 

2 

4 

8 

32 

128 

512 

55-59 

15 

3 

45 

135 

405 

1215 

50-54 

14 

2 

28 

56 

112 

224 

45-49 

29 

1 

29 

29 

29 

29 

40-44 

74 

0 

0 

0 

0 

0 

35-39 

60 

-1 

-60 

60 

-60 

60 

30-34 

63 

-2 

-126 

252 

-504 

1008 

25-29 

29 

-3 

-87 

261 

-783 

2349 

20-24 

6 

-4 

-24 

96 

-384 

1536 

15-19 

8 

-5 

-40 

200 

-1000 

5000 

300  =n 

-227 

1121 

-2057 

11933 

From  Table  II.8. 


V  =  1 


227 


1            300 

— 

—  0.75667 

v  »        1121 

= 

3.73667 

'2       "   300 

-2057 

= 

—  6.85667 

'3              300 

11933 
v/'  - 

39.77667 

300 
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Hence  from  11.25.10  and  11.25.16.,  it  is  found  that 

[X0  =  1. 
fi.i=0. 

|jl2  =  c2  (v2"  —  v/'2)  =  25  (3.73667  —  .57255)  =  79.1 
H3  =  c3  (v3"  —  3  Vl"  v2"  +  2  vx"3)  =  125  [—  6.85667 

—  3  (—  0.75667)  (3.73667)  +  2  (—  0.75667)3]  =  311.5 
m  =  o-  (V  -  4  v/'  v3"  +  6  v/'2  v2"  -  3  V'4) 

=  625  [39.77667  —  4  (—  0.75667)  (—  6.85667)  +  6  (0.75667)2 
(3.73667)  —  3  (—  0.95667)4]  =  18342.1 

It  is  also  useful  to  find 

[if  97032.25 

P2  =  i-3-  = =  0.196  11.25.17. 

^        {x3        494913.67 


and 


{Jt4  18342.1 

S2    =—  = =2.93.  11.25.18 

H2         {jl2*        6256.81 


Pi  is  an  index  of  skewness  and  is  useful  to  compare  the  intensity 
of  the  departure  from  symmetry  of  a  distribution  with  another 
distribution.  If  the  distribution  is  symmetrical,  pf,  has  the  value 
zero. 

p2  is  an  index  of  kurtosis  (flatness)  and  is  sometimes  used  to 
determine  whether  a  given  distribution  is  more  flat  or  less  flat  than 
a  corresponding  "normal"  distribution. 

pf  and  p|  are  useful  for  determining  which  curve  of  a  set  of 
curves  is  indicated  by  the  data  as  a  useful  law  of  behavior.  The 
theory  attached  to  these  concepts  was  developed  by  the  late  Karl 
Pearson  and  will  be  discussed  briefly  in  Chapter  III. 

11.26.  Relation  Between  Means.  For  positive  numbers, 

x^  <C  x2  <c  .  .  .  <C  Xjj, 

Xl  <  H.M.  <  G.M.  <  A.M.  <  R.M.S.  <  C.H.M.  <  xn. 

11.27.  Desirable  Properties  of  An  Average. 

(a)  An  average  should  be  precisely  defined. 

(b)  An  average  should  be  based  on  all  observations. 
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(c)  An  average  should  possess  some  simple  and  obvious  proper- 
ties to  render  its  general  nature  comprehensible:  it  should 
not  be  too  abstract  in  mathematical  characterization. 

(d)  An  average  should  be  possible  of  easy  and  rapid  calculation. 

(e)  It  should  be  as  little  affected  as  maybe  possible  by  fluctua- 
tions of  sampling  or  by  sampling  errors. 

(f )  The  measure  chosen  should  lend  itself  to  algebraic  treatment 
and  its  basis  should  be  concordant  with  the  basis  of  the 
problems  to  be  analyzed. 

These  properties  applied  to  the  mean,  median,  and  mode,  geo- 
metric mean,  and  harmonic  mean  are: 

I.  Arithmetic  Mean.  The  A.M.  satisfies  a,  b,  c,  d,  e,  f .  The  arith- 
metic mean  has  the  following  properties. 

(a)  The  sum  of  the  deviations  from  the  mean,  taken  with  their 
proper  signs  is  zero. 

(b)  The  mean  of  a  whole  series  can  be  readily  expressed  in  terms 
of  the  means  of  its  components. 

(c)  The  mean  of  all  the  sums  or  differences  of  corresponding 
observations  in  two  series  (of  equal  numbers  of  observations) 
is  equal  to  the  sum  or  difference  of  the  means  of  the  two 
series. 

(d)  The  sum  of  squares  of  the  deviations  from  the  arithmetic 
mean  is  a  minimum. 

II.  Median.  The  median  satisfies  (b)  and  (c)  but  the  definition 
does  not  necessarily  lead  in  all  cases  to  a  determinate  result.  The 
median  is  easier  to  compute  than  the  arithmetic  mean.  The  arith- 
metic mean  is  superior  to  median  in  lending  itself  to  algebraic 
treatment.  No  theorem  for  median  exists  similar  to  (b)  for  mean 
and  likewise  to  (c).  The  median  has  the  following  advantages  over 
the  mean: 

(a)  It  is  very  readily  calculated :  a  factor  to  which,  however,  as 
already  stated,  too  much  weight  ought  not  to  be  attached. 

(b)  It  is  readily  obtained  without  necessity  of  measuring  all 
objects  to  be  observed. 

(c)  Sum  of  the  deviations  from  Median,  all  >  0,  is  a  minimum. 

III.  Mode.  What  we  want  to  arrive  at  is  the  mid- value  of  the  inter- 
val for  which  the  frequency  would  be  a  maximum,  if  the  intervals 
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could  be  made  indefinitely  small  and  at  the  same  time  their 
number  be  so  increased  that  the  class  frequency  would  run 
smoothly.  A  smoothing  process  is  necessary;  viz.  that  of  fitting 
an  ideal  frequency  curve  of  given  equation  to  actual  figures. 

IV.  Geometric  Mean.  The  geometric  mean  is  used  in  averaging 
rates  or  ratios  rather  than  quantities. 

(a)  If  the  ratios  of  the  geometric  average  to  the  measures  it  ex- 
ceeds or  equals  be  multiplied  together,  the  product  will  be 
equal  to  the  product  of  the  ratios  of  the  geometric  average 
to  those  measures  which  exceed  it  in  value. 

If  x2  <  x2  <  x3  <  ...  <  xk  <  G.M.  <  xk+i  <  xk+2  <  . .  .  <  xn, 

then,  -.- £-5S±i  .5E±-2 5.  H.27.1. 

xx    x2  Xk       G         G  G 

(b)  The  geometric  average  of  the  ratios  of  corresponding  obser- 
vations in  two  series  is  equal  to  the  ratio  of  their  geometric 
averages. 

(c)  The  geometric  average  of  the  series  formed  by  combining  n 
different  series  each  with  the  same  frequency  is  the  geo- 
metric average  of  the  geometric  averages  of  the  separate 
series. 

V.  Harmonic  Mean.  The  harmonic  average  of  a  set  of  measure- 
ments must  be  used  in  the  averaging  of  time  rates. 

Having  shown  the  initial  procedure  necessary  for  a  statistical  ana- 
lysis, namely,  how  to  summarize  data  and  how  to  obtain  summary 
numbers  for  the  purpose  of  characterizing  the  law  of  behavior  of 
the  observed  facts,  we  shall  now  develop  the  necessary  theory  that 
is  basic  for  the  analysis  and  solution  of  traffic  problems. 
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CHAPTER  III 

STANDARD  DISTRIBUTIONS 
AND  THEIR  MATHEMATICAL  PATTERNS 

III.  1.  Objective.  The  purpose  of  this  chapter  is  to  explain  the  related 
problems  of  first  ascertaining  the  nature  of  a  universe  of  events  and 
second  finding  a  mathematical  model  or  pattern  that  fits  the 
universe.  From  experience  and  intuition,  we  know  that  a  sample 
will  tell  us  something  about  the  entire  series  of  events,  and  that 
the  larger  the  sample  the  more  accurately  it  reflects  the  character- 
istics of  the  parent  universe.  We  reason  that  a  mathematical  model 
of  the  sample,  if  the  sample  is  large,  will  also  be  a  model  of  the 
universe.  Obviously,  this  fitting  of  mathematical  patterns  will  be 
much  easier  if  we  know  something  about  the  types  of  universes  or 
distributions  of  events  we  may  expect  to  find. 

There  are  three  of  these  theoretical  distributions  that  constitute 
the  basic  patterns.  They  are,  in  the  order  of  their  discovery,  the 
Binomial  (James  Bernoulli  about  1700),  the  Normal  (Demoivre 
about  1700,  Laplace  and  Gauss  about  1800),  and  the  Poisson  (B.D. 
Poisson  about  1837).  Other  distribution  patterns  have  been  dis- 
cussed by  Gram  (1879),  Fechner  (1897),  Thiele  (1900),  Edgeworth 
(1904),  Charlier  (1905),  Brun  (1906),  Romanowsky  (1924),  and 
others.  These  are  in  general  either  other  approaches  to,  modifica- 
tions, or  generalizations  of  the  three  basic  distributions.  The  most 
logical  order  to  present  these  from  the  standpoint  of  clearness  is 
also  the  historical  order  of  appearance.  But  before  considering  the 
first  of  these,  the  Binomial  distribution,  we  shall  discuss  the  ele- 
ments that  make  up  a  distribution. 

Ill  .2.  The  Elements  of  a  Distribution.  In  order  to  define  and  to  point 
out  the  interrelationships  of  the  elements  that  make  up  a  distri- 
bution, let  us  consider  a  trial  like  the  throwing  of  a  die.  The  result 
will  be  the  happening  or  non-happening  of  a  specific  event  such  as 
the  falling  of  the  die  with  one  spot  on  the  top  face. 

An  event,  of  course,  can  be  the  occurrence  of  any  attribute  or 
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characteristic  as  well  as  a  happening.  In  traffic,  for  example,  it 
could  be  the  age  of  a  driver,  his  seeing  ability,  the  life  of  an  auto- 
mobile tire,  the  weight  class  of  a  truck,  the  volume  of  traffic,  the 
speed  of  a  vehicle,  or  any  one  of  many  other  things. 

The  happening  of  a  specific  thing  is  called  the  Event  E,  and  the 
non-happening  is  called  the  complementary  event  E.  If  the  die  is 
thrown  a  limited  number  of  times  (number  of  trials),  we  get  a 
sample  distribution  of  E's  and  jE"s.  If  the  number  of  trials  is  in- 
creased without  limit,  the  observed  sample  distribution  approaches 
the  true  or  theoretical  distribution  of  the  universe  or  total  popula- 
tion of  the  events. 

There  are  thus  two  kinds  of  distributions:  (a)  the  theoretical 
and  (b)  the  experimental  or  sample  distribution. 

The  Theoretical  Distribution :  In  order  to  explain  the  theoretical 
distribution,  let  ft  be  the  number  of  ways  in  which  the  event  E  can 
take  place,  fc  the  number  of  ways  for  the  complementary  event  E, 
and  n  the  total  number  of  trials  or  happenings  and  non-happen- 
ings. 

The  probability  that  the  event  E  will  occur  is  the  ratio  of  the 
number  of  ways  ft  in  which  E  can  happen  to  the  total  number  of 
possible  and  equally  likely  happenings  and  non-happenings.  Let 
p  or  P  (E)  be  this  probability,  then  symbolically 

p  =P(E)  =ft/n  III.2.1. 

Similarly,  the  total  number  of  ways  fc  in  which  the  event  E  can 
happen  divided  by  n  is  defined  as  the  probability  (a-priori,  true,  or 
theoretical)  that  the  event  E  will  occur.  Let  g  or  P  (E)  be  this 
probability,  then  symbolically 

■f  -F 

q  =  P  (E)  =  f  c/n  =  — -  =1 -.  III.2.2. 

n  n 

In  the  case  of  a  die,  if  E  is  the  event  of  the  die's  falling  with  one- 
spot  on  the  top  face  and  E  is  the  event  of  the  die's  falling  some 
other  way,  then      ft  =  1,  fc  =  5,  n  =  6 
and 

p  =  P(E)  =  i;q  =  P(E)  =  *;     and    p  +  q«L  +  ±«l. 
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Again  if  n  is  the  total  number  of  registered  vehicles  and  ft  is  the 
number  of  light  trucks,  then 

p  =  p(E)  =  fi 

is  the  true  probability  that  a  vehicle  is  a  truck. 

In  general,  let  a  be  the  number  of  times  the  event  E  occurs,  and 
let  b  be  the  number  of  times  the  event  E  occurs,  these  being  the 
only  possibilities.  Then  p  =  a/ (a  +  b)  is  the  probability  that  the 
event  happens  as  specified  -  event  E,  and  q  =  b/(a  +  b)  is  the 
probability  that  the  event  does  not  occur  -  event  E.  It  follows  that 
p  +  q  =  1,  which  simply  demonstrates  what  we  know  intuitively 
that  an  event  is  certain  to  happen  or  not  to  happen.  This  also 
shows  that  both  p  and  q  are  positive  numbers.  This  is  the  Funda- 
mental additive  property  in  probability.  This  property  is  also  re- 
ferred to  in  the  literature  as  the  Rule  of  Complementation. 

Let  us  now  suppose  that  one  tosses  a  penny  twice  and  wishes  to 
find  the  probability  of  getting  two  heads.  One  might  reason  falsely 
that  there  are  three  possibilities :  two  heads,  two  tails,  or  one  head 
and  one  tail.  One  of  these  outcomes  is  two  heads,  therefore,  one 
might  reason  that  the  probability  is  -J-,  but  this  reasoning  is  false, 
for  the  events  are  not  equally  likely.  The  third  event  may  occur  in 
two  ways  for  a  head  could  appear  on  the  first  trial  and  the  tail  on 
the  second,  or  the  head  could  appear  on  the  second  and  the  tail  on 
the  first.  There  are  really  four  equally  likely  outcomes  or  phases : 
HH,  HT,  TH,  TT;  and  the  correct  probability  is  therefore  |-.  The 
four  events  are  independent  and  mutually  exclusive.  If  two  heads 
are  up,  that  is  the  only  possible  combination,  for  if  a  penny  is 
heads  up,  it  obviously  cannot  at  the  same  time  be  tails  up.  This 
mutual  exclusiveness  does  not  always  exist.  Suppose  that  one 
wishes  to  compute  the  probability  of  drawing  a  king  or  a  heart 
from  a  deck  of  cards.  The  chances  might  be  assumed  to  be  -k- 
since  there  are  4  kings  and  13  hearts.  But  this  is  incorrect,  for  the 
drawing  of  a  king  does  not  exclude  drawing  of  a  heart.  The  king 
may  also  be  a  heart. 

The  Experimental  Distribution:  The  experimental  or  sample 
distribution  is  obtained  from  a  number  of  observations  of  events. 
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Let  f0  be  the  number  of  times  the  event  E  is  observed  to  happen 
and  n  the  total  number  of  trials  or  observations.  The  ratio  f0/n  is 

called  the  relative  frequency  of  the  event  E  and  II 1  is  the  rela- 
tive frequency  of  the  event  E. 

The  obtaining  of  the  numerical  values  of  the  relative  frequencies 
f0/n  is  actually  a  very  simple  problem  since  it  is  essentially  a 
problem  of  counting.  The  value  of  f0/n  in  contrast  to  the  true 
probability  varies  with  the  number  of  observations  or  trials  n. 
One  might  count  all  the  traffic  violations  that  occurred  at  an  inter- 
section during  the  passing  of  5000  vehicles  and  find  that  there 
were  no  violations.  In  this  situation,  the  observed  f0  =  0,  n  =  5000 
and  f0/n  =  0/5000  equals  zero.  But  if  the  violations  occurring 
during  the  passing  of  25000  vehicles  were  counted,  it  might  be 
found  that  there  were  4  violations,  and  now  the  observed  f0  =  4, 
n  =  25000,  and  f0/n  =  4/25000.  Actually,  we  need  to  know  the 
probable  or  expected  value  of  such  observed  relative  frequencies, 
f0/n.  This  is  defined  as  the  true  probability  p  that  the  event  E  will 
occur  and  it  is  the  limit  that  f0/n  approaches  as  the  number  of 
trials  (observations)  is  indefinitely  increased.  Expressed  symbolic- 
ally, if  E  (f0/n)  is  the  symbol  for  the  probable  or  expected  value  of 
an  observed  relative  frequency  f0/n,  then 

E  (fA  =  Limit  (fl)  =  p  =  p  (E)  III.2.3. 

W  n^oo    \n/ 

It  should  be  noted  that  in  actual  cases  n  need  not  be  infinite  to  give 
a  practical  result.  It  is,  however,  necessary  that  n  is  not  small. 

The  discussion  just  given  may  be  summarized  with  two  defini- 
tions : 

Definition  1.  If  an  event  E  can  happen  in  ft  cases  out  of  a  total 
of  n  possible  cases  which  are  all  considered  by  mutual  agreement 
to  be  equally  likely,  then  the  probability  p  =  p  (E)  that  the  event 
E  will  occur  is  defined  to  be  (ft/n).  Symbolically,  p  =  P  (E)  s=  ft/n. 

Definition  2.  If  a  series  of  many  observations  or  trials  is  made, 
and  if  the  ratio  of  the  number  of  times,  f0,  the  event  E  occurs,  to 
the  total  number  of  observations,  n,  namely,  f0/n,  approaches 
nearer  and  nearer  to  a  definite  number,  p  =  P  (E),  as  larger  and 
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larger  sets  of  trials  or  observations  are  made,  then  the  probability 
of  E  is  defined  to  be  p.  Expressed  symbolically, 

Limit  p)  =  p  =  P  (E) 

n->oo     \n/ 

An  important  question  yet  to  be  answered  is:  How  much  in 
error  is  f0/n  from  p  for  a  given  number  of  observations  and  how 
certain  are  we  that  this  error  is  not  exceeded  ?  In  other  words,  for 
a  given  degree  of  certainty,  how  large  a  sample  of  observations 
must  be  made  to  guarantee  that  a  specified  error  will  not  be  ex- 
ceeded ? 

This  question  is  answered  by  the  fundamental  theorems  of 
Bernoulli1  and  Cantelli2  and  by  the  Bienayme  -  Tchebycheff 
criterion,3  which  will  be  stated  without  proof. 

III.  3.  Bernoulli's  Theorem.1  Bernoulli  found  that  there  is  a  definite 
number  of  observations  that  will  give  a  certain  assurance  that  a 
given  error  will  not  be  exceeded.  His  finding  is  based  upon  a 
natural  law  which  may  be  demonstrated  by  the  tossing  of  a  penny. 
If  the  penny  is  not  defective,  the  probability  p  of  getting  a  head  is 
•J-.  Let  us  now  assume  4  heads  have  been  obtained  in  10  tosses. 
This  relative  frequency  (f0/n)  or  ~-  is  in  error  from  the  true  or 
theoretical  probability  p  of  -^by  0.1.  Let  us  next  assume  that  we 
have  tossed  the  penny  100  times  and  obtained  51  heads.  The  re- 
lative frequency  -&£-  is  now  in  error  by  only  0.01.  With  more  tosses 
there  would  be  a  tendency  toward  a  further  decrease  in  error  which 
would  lead  us  to  suspect  that  something  may  be  known  about  the 
number  of  trials  that  are  necessary  in  order  to  get  from  observa- 
tions a  probability  that  will  differ  from  the  theoretical  probability 
p  by  less  than  an  arbitrarily  assigned  positive  quantity  e,  known 
as  the  experimental  error. 

The  next  question  to  be  answered  is  how  certain  are  we  that  the 
error  will  not  be  more  than  s.  The  measure  of  our  confidence  that 
e  is  the  maximum  error  is  indicated  by  attaching  a  probability  to 
s.  This  probability  is  dependent  upon  the  number  of  trials  n. 

The  probability  y]  that  e  is  not  the  maximum  error  is  the  com- 
plement of  the  probability  that  e  is  the  maximum  error.  This 
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probability,  y],  is  the  measure  of  our  lack  of  confidence  that  e  is  not 
exceeded  and  is  called  the  level  of  significance.  If  yj  is  the  level  of 
significance,  then  1  —  jj  is  the  measure  of  our  confidence  or  ability 
to  prove  that  z  is  not  exceeded.  The  number,  Eta,  is  also  some- 
times called  the  risk.  In  common  parlance,  if  we  are  75  per  cent 
certain  of  our  result,  we  are  25  per  cent  uncertain,  or  in  other 
words,  the  risk  is  25  per  cent. 

If  we  wished  to  find  the  size  of  sample  necessary  to  give  us  a 
99  per  cent  guarantee  that  the  relative  frequency  (f0/n)  obtained 
would  differ  from  the  theoretical  probability  p  for  the  universe  by  not 
more  that  0.03,  e  would  be  0.03  and  y)  would  be  0.01.  The  value  of 
0.01  for  7]  would  mean  that  1  per  cent  of  the  time  it  would  be  impos- 
sible to  explain  the  difference  between  the  observed  and  the  theore- 
tical frequency  other  than  that  it  just  happened.  In  other  words,  it 
would  mean  that  the  odds  are  99  to  1  in  favor  of  finding  at  least 
one  real  reason  for  the  existence  of  the  difference  other  than  that 
it  was  merely  accidental. 

Having  examined  the  underlying  theory  of  Bernoulli's  theorem, 
we  will  now  state  it  more  rigorously:  For  any  arbitrarily  given 
z  >  0  and  0  <  r\  <  1  there  exists  a  number  of  trials  n0  dependent 
upon  both  z  and  yj  [symbolically  n0  (e,  yj)}  such  that  for  any  single 
value  of  n  >  n0  (z,  tj),  the  probability  that  the  observed  relative  fre- 
quency, (fQjn)  of  an  event  E  in  a  series  of  n  independent  trials  with 
constant  probability  p  will  differ  from  this  probability  p  by  less  than 
z,  will  be  greater  than  1  —  r\. 

Symbolically,  this  is  written 

P{|f0/n -p|  <e}>  1  — 7]  for  n>iv  III.3.1. 

The  n  >  i^  in  Bernoulli's  theorem  is  given  by  the  following  in- 
equality : 

n  >  no^  L±-?l0ge-  +  -  III.3.2. 

Z*  Y]  Z 

Example  1.  Given  e  =  0.01  and  yj  =  0.01.  Substituting  these  given 
values  in  the  inequality  III. 3. 2.,  we  get 

n  >  no  =  — - —  logp 1 ,  whence  n  >  n^  =  46613. 

*      (.01)2     6e  0.01       0.01  * 
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In  this  example,  Uq  =  46613.  However,  n  is  any  single  number 

greater  than  46613. 

Example  2.   Given    e  =  0.01   and  rj  =  0.05.    Substituting  these 

given  values  in  the  inequality  III. 3. 2.,  we  find  that 

1.01  ,11, 

n  >  r^  = loffp 1 ,   whence  n  >  n^  =  30357. 

*         (.01)2     6e  0.05       0.01  * 

Hence  n0  =  30357  and  n  is  any  single  number  greater  than  30357. 
A  comparison  of  the  results  of  the  two  examples  shows  that  re- 
ducing the  certainty  from  99  per  cent  to  95  per  cent  reduced  the 
size  of  the  sample  required  from  46614  to  30358. 

Increasing  the  allowable  experimental  error  will  also  decrease 
the  size  of  the  sample  required. 

Example  3.  Given  s  =  0.05  and  yj  =  0.05.  Substituting  these 
given  values  in  III. 3. 2.,  it  is  found  that 

1.05  ,11, 

n  >  m  = logp 1 ,    whence  n  >  n*  =  1278. 

*      (.05)2     6e0.05       0.05  ^ 

Under  the  conditions,  n  is  any  single  number  greater  than  1278. 

The  result  of  Example  3  means  that  if  a  random  set  of  1279  ob- 
servations is  taken,  we  are  95  per  cent  certain  that  the  true  probab- 
ility p  for  the  occurrence  of  the  event  E  will  be  between  the  values 
f0/n  —  0.05  and  f0/n  -f-  0.05.  This  may  be  expressed  symbolically  as 

P  {  |  f0/n  —  p  |  <  0.05  }  >  0.95 
for  any  single  n  >  1278.  There  are  similar  interpretations  for  ex- 
amples 1  and  2. 

An  examination  of  Bernoulli's  theorem  shows  that  the  number 
of  observations  necessary  for  a  given  result  is  totally  independent 
of  the  true  probability  p  and  hence  is  independent  of  the  theore- 
tical distribution  law.  In  other  words,  without  knowing  anything 
about  the  nature  of  the  law  of  behavior,  it  is  possible  to  determine 
the  sample  size  for  a  specified  accuracy  and  certainty.  If,  however, 
we  have  some  knowledge  of  the  law  of  behavior  which  is  the  case 
in  nearly  all  practical  applications,  the  size  of  the  sample  will  be 
much  smaller  than  indicated  in  Examples  1,  2,  3,  —  sometimes 
even  less  than  100.  This  will  be  made  more  apparent  in  later  dis- 
cussions. 
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For  the  sake  of  clarity,  let  us  summarize  the  various  aspects  of 
Bernoulli's  theorem.  This  theorem  is  based  upon  the  law  that  as  n 
increases,  the  measure  of  uncertainty  yj  decreases.  It  enables  us  to 
find  for  a  fixed  error  e  and  measure  of  uncertainty  yj  the  size  of  a 
single  n.  This  being  the  case,  it  is  now  possible  to  learn  how  large 
n  must  be  so  that  the  sum  of  all  the  decreasing  measures  of  risk 
(the  yj's)  for  all  N's  larger  than  n,  is  less  than  a  selected  yj  and  an 
assigned  error  s.  It  follows,  of  course,  that  if  the  sum  of  the  risks 
in  question  is  less  than  yj,  then  any  one  of  them  is  less  than  ij. 

More  precisely :  Instead  of  there  being  any  single  n  >  n^,  for 
a  given  s  and  yj  there  is  a  number  of  trials,  N,  which  is  such  that 
the  sum  of  the  risks  for  all  n's  >  N,  is  at  most  yj.  The  number  N  is 
found  by  Cantelli's  theorem. 

III.  4.  Cantelli's  Theorem.2 For  a  given  e<  1,  yj  <  1,  let  n>  N  (s,  yj) 
be  an  integer  satisfying  the  inequality : 

n>4loge2  +2.  III.4.1. 

Z*  Y] 

With  the  value  of  n  given  by  the  inequality,  the  probability  that  the 
observed  relative  frequency  (fQjn)  of  an  event  E  will  differ  from  the 
actual  theoretical  probability  p  by  less  than  e  in  the  nth  and  all  the 
following  trials  is  greater  than  1  —  yj. 

Thus  Cantelli's  theorem,  as  noted  above  gives  the  probability 
for  all  n's  >  N  (e,  yj),  namely  for  n  =  N,  N  +  1,  N  +  2,  .  . . ,  that 
I  f0/n  —  p  |  <  e.  The  complementary  probability  is  the  probability 
that  at  least  one  of  the  inequalities  I  f0/n  —  p  I  <  e  is  true  where  n 
may  be  equal  to  either  N,  or  N  +  1,  or  N  +  2,  ...  Since  these 
different  possibilities  form  a  set  of  mutually  exclusive  events  it 
follows  that  the  probability  that  at  least  one  of  the  events  has 
occurred  is  the  sum  of  the  probabilities  that  that  one  and  all  the 
following  events  have  occurred. 

Now,  if  Q{Q  <;  yj)  is  the  probability  of  this  complementary 
event  then  it  is  the  probability  that  the  experimental  error  is  at 
most  e  in  the  nth  and  any  or  all  of  the  following  trials. 

If  we  know  or  specify  any  two  of  the  quantities  n,  s,  yj,  the 
third  may  be  found  in  terms  of  Bernoulli's  theorem  (III. 3. 2.)  or 
CanteUi's  theorem  (III.4.1.). 
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Since  the  probability  that  the  experimental  error  is  at  most  e 
in  any  single  number  of  trials  greater  than  a  given  number  n0  is 
more  restricted  than  the  probability  that  the  experimental  error 
is  at  most  e,  in  all  the  number  of  trials  greater  than  N,  we  would 
expect,  as  is  the  case,  that  more  trials  are  necessary  for  the  less 
restricted  situation  covered  by  the  Cantelli  theorem  than  are 
necessary  for  the  Bernoulli  theorem. 

It  is  important  to  note  that  in  both  Cantelli's  and  Bernoulli's 
theorems,  the  number  of  trials  necessary  is  independent  of  the 
probability  p  that  the  event  will  happen  as  specified  and  hence 
is  independent  of  the  distribution  law.  In  other  words,  the  results 
are  true  as  long  as  we  are  sure  that  the  event  will  happen  or  will 
not  happen,  or  speaking  mathematically,  so  long  as  it  is  true  that 
p  +  q  =  1  where  q  is  the  probability  that  the  event  will  not 
happen  as  specified. 

If  the  value  of  p  is  known  which  is  the  same  as  saying  that  we 
know  the  distribution  law,  and  n  is  also  dependent  on  p  then,  in 
general,  the  number  of  trials  found  from  theorems  III. 3. 2.  and 
III.4.1.  is  much  too  large.  This  fact  will  be  demonstrated  later. 

Example  1.  Letting  e  =  0.01  and  r\  =0.01  as  in  example  1 
above  and  substituting  these  given  values  in  the  inequality  III.4.1., 

2  2  2  2 

n  >  —  logp  — h  2  = logP h  2 ,  whence 

s2    6e7]  (0.01)2    *e0.01 

n  >  152,021. 

In  this  example,.  N  =n  +  1  =  152,022.  Therefore  in  the 
152,022nd  trial  and  all  the  following  trials  (and  hence  in  at  least 
one)  we  are  assured  that  the  observed  relative  frequency  (f0/n)  will 
differ  from  the  theoretical  probability  p  by  at  most  0.01  and  that 
it  is  (1  —  r\)  =  0.99  equals  99  per  cent  certain  that  this  is  true 
and  only  1  per  cent  uncertain  that  this  is  true. 

Example  2.  Let  s  =  0.01  and  ?)  =  0.05,  then  III.4.1.  becomes 

2  2 

n  > logp f-  2,  whence 

(0.01)2     5e0.05 

n  >  119,832. 
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Example  3.  Let,  as  in  example  3  above,  e  =  0.05  and  v)  =  0.05. 
In  this  case,  III.4.1.  becomes 

2  2 

n  >  Iosl h  2,  whence  n  >  4796. 

(0.05)2     6e0.05 

The  results  of  these  examples  when  compared  with  the  minimum 
number  of  trials  necessary  when  using  Bernoulli's  theorem  show 
that  Cantelli's  theorem  requires  more  trials.  This  is  because  Can- 
telli's  theorem  gives  a  value  for  all  n's  greater  than  N  while  Ber- 
noulli's theorem  gives  a  value  for  any  single  n  greater  than  n^.  In 
either  case,  as  the  number  of  trials  is  increased,  the  probability 
that  the  experimental  error  s  has  a  specified  upper  limit  becomes 
greater  and  greater,  and  r\  becomes  smaller  and  smaller. 

The  theorems  of  Bernoulli  and  Cantelli  are  based  upon  the  idea 
that  there  is  definite  probability  that  the  values  of  a  stochastic 
variable  will  fall  within  a  specified  range. 

Another  approach  is  to  find  the  probability  that  a  stochastic 
value  taken  at  random  will  differ  from  some  chosen  value  a  by  as 
much  as  a  specified  amount,  D.  This  probability  is  given  by  the 
Bienayme-Tchebycheff  Criterion.3 

III.  5.  The  Bienayme-Tchebycheff  Criterion .3  This  criterion  is  inde- 
pendent of  the  form  of  distribution  of  given  measurements  and  in 
addition  is  independent  of  the  origin.  If  X  is  the  stochastic  variable 
which  may  assume  the  values  Xi  (i  =  1,  2,  . . . ,  n),  and  if  pi  (i  = 
1,  2,  . . .,  n)  are  the  corresponding  probabilities,  where  21  pi  =  1 
and  if  a  is  any  number  (origin)  from  which  the  differences  of  the 
X's  are  measured,  then 

D2  =  E  (Xi  —  a)2  =  2  piXi2  III.5.1. 

where  xi  =  xi  —  a  and  D2  is  the  expected  value  of  the  squares  of 
the  differences  of  the  X's  from  a. 

Under  these  conditions,  it  is  found  that,  if  X  >  1, 

P  (X  D)  ^  1/X2  III.5.2. 

This  expression,  wherein  (X  D)  means  X  times  D  and  X  equals  the 
multiple  of  the  differences  D  from  the  chosen  number  a,  is  the 
Bienayme-Tchebycheff  Criterion. 
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The  criterion,  to  state  it  in  words,  says  that  the  probability 
P  (X  D)  is  not  more  than  1/X2  that  a  stochastic  variable  taken  at 
random  will  differ  from  some  chosen  number  a  by  as  much  as 
X  (X  >  1)  times  the  value  of  D.  A  very  useful  special  case  is  when 
a  is  the  probable  or  expected  value. 

Example  1.  If  the  probability  P  (X  D)  =  yj  <[  .01  and  s  =  .01, 
then  for  any  a  and  p,  X  must  be  j/100  =  10.  It  will  be  seen  later 
that  n  must  be  greater  than  250,000. 

Example  2.  If  the  probability  P  (X  D)  =  73  ^  .05  and  e  =  .01, 
then  for  any  a  and  p,  X  must  be  }/20.  In  this  case  n  >  50,000. 

Example  3.  If  the  probability  P  (X  D)  =  73  ^  .05  and  s=.05, 
then  for  any  a  and  p,  X  must  be  j/20.  In  this  case  n  >  2000. 

These  illustrations  demonstrate  that  quite  frequently  the  ex- 
perimenter gathers  more  data  than  is  necessary  for  the  accuracy 
required.  This  makes  the  cost  of  the  study  unnecessarily  large  and 
demonstrates  a  lack  of  efficiency  as  well  as  an  approach  that  is 
scientifically  unsound . 

If  we  have  a  limit  definition  of  probability,  Bernoulli's  theorem 
is  an  immediate  consequence  thereof.  In  case  we  have  any  defini- 
tion of  probability  p  for  the  event  E  happening  as  specified,  it  is 
possible  to  prove  Bernoulli's  theorem  by  the  use  of  the  Bienayme- 
Tchebycheff  criterion.  This  will  be  shown  later  in  this  chapter. 

In  general,  the  evaluation  of  the  probability  of  a  given  chance 
event  necessitates  the  enumeration  of  all  possible  outcomes.  These 
outcomes  as  shown  by  the  tossing  of  a  penny  or  the  drawing  of  a 
card  involve  combinations  and  arrangements  (permutations)  of 
happenings. 

III.  6.  Permutations  and  Combinations.  There  are  two  basic  prin- 
ciples in  combinations : 

1.  If  an  event  A  can  occur  in  a  total  of  a  ways  and  an  event  B 
can  occur  in  a  total  of  b  ways,  then  A  and  B  can  occur  in 
a  +  b  ways,  provided  they  cannot  occur  at  the  same  time. 

2.  If  an  event  A  can  occur  in  a  total  of  a  ways  and  an  event  B 
can  occur  in  a  total  of  b  ways,  then  A  and  B  can  occur  to- 
gether in  a  •  b  ways. 

These  two  principles  can  be  generalized  to  take  account  of  any 
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number  of  events.  Three  independent  events  A,  B,  or  C  can  occur 
in  a  -f  b  +  c  ways  and  three  events  A,  B,  and  C  can  occur  to- 
gether in  a  •  b  •  c  ways. 

These  ideas  may  be  illustrated  by  letting  A  represent  the  draw- 
ing of  a  heart  from  a  deck  of  cards  and  B  the  drawing  of  a  spade. 
Since  there  are  13  hearts,  there  are  13  ways  of  drawing  a  heart, 
and  likewise  for  spades.  The  number  of  ways  in  which  a  heart  or 
a  spade  can  be  drawn  is  13  +  13  =  26.  The  second  principle  is  also 
illustrated  by  the  drawing  of  a  heart  and  a  spade  together.  There 
are  13-13  ways  of  doing  this,  for  with  any  one  of  the  13  hearts  we 
may  put  one  of  the  13  spades,  and  with  any  one  of  the  13  spades, 
we  may  put  one  of  the  13  hearts  and  so  on. 

A  more  general  illustration  of  the  second  principle  is  that  of  a 
room  in  which  there  are  n  seats  and  x  individuals  to  be  seated,  and 
where  x  <  n.  We  wish  to  know,  in  how  may  different  ways  (arrange- 
ments or  permutuations)  these  x  individuals  may  be  seated  in  the 
room.  To  find  out  we  may  proceed  as  follows :  Assume  that  all  the 
x  individuals  are  outside  the  room.  The  first  one  to  come  in  has  n 
choices.  He  seats  himself.  When  a  second  individual  comes  in,  he 
has  (n —  1)  choices,  or  one  choice  less  than  the  first  individual. 
For  the  third  individual  there  are  (n  —  2)  choices,  or  one  less  than 
for  the  second  person.  Hence,  there  are  n  (n  —  1)  (n  —  2)  choices 
(arrangements  or  permutations)  for  the  first  three.  This  illustra- 
tion brings  out  the  fact  that  permutations  have  to  do  with  single 
items  or  groups  of  items  treated  as  units  and  that  the  choice  for 
each  succeeding  individual  (item  or  group)  is  reduced  by  one. 

If  we  continue  until  all  the  x  individuals  are  seated  and  if  npx 
is  the  number  of  choices,  then 

npx  =  n  (n  -  1)  (n  -  2)  (n  -  3)  .  .  .  (n  -  x  +  1)     III.6.1. 

This  expression  may  be  shortened  by  multiplying  it  by 

(n  —  x)  (n  —  x  —  1)  (n  —  x  —  2) 3.2.1    _  (n  —  x) ! 

(n  —  x)  (n  —  x  —  1)  (n  —  x  —  2)  .  . . .  3.2.1  "~  (n  —  x)! 
It  then  becomes 


n! 

nPx=  ; ~. 

(n  — x)! 


III.6.2. 
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In  the  case  when  x  =  n,  III.6.1.  becomes 

npx  =  n  (n  -  1)  (n  —  2)  (n  -  3)  . . .  3.2.1.  =  n!  III.6.3. 
and  this  is  the  number  of  permutations  (arrangements)  of  n  things 
taken  n  or  all  at  a  time. 

Let  us  now  turn  to  the  question  of  how  many  different  combina- 
tions of  x  things  are  possible  if  n  things  are  available.  A  combina- 
tion is  an  unarranged  or  unordered  set  of  things,  while  a  permuta- 
tion is  an  arranged  or  ordered  set  of  things. 

Definition :  The  number  of  different  unordered  sets  of  x  (x  <  n) 
things  which  can  be  selected  from  a  set  of  n  things  is  called  the 
number  of  combinations  of  the  n  things  taken  x  at  a  time ;  and  is 
designated  by  the  symbol  nCx. 

To  find  nCx  it  is  only  necessary  to  keep  in  mind  that  we  may 
have  permutations  of  groups  (or  combinations)  as  well  as  of  in- 
dividuals. After  all  the  different  groups  have  been  obtained,  the 
individuals  in  each  group  may  be  arranged  to  give  the  total 
number  of  permutations. 

The  number  npx  is  thus  the  number  of  ways  we  can  make  nCx 
group  choices  followed  by  x!  independent  individual  choices. 
That  is 

nPx  *=  nCx '  X  ! 

hence  nCx  =  ^  = ~- -  III.6.4. 

x!        (n  —  x)!x! 

n! 

since  from  III.6.2.  npx  = 

(n  — x)! 

Example:  Let  us  find  (a)  the  number  of  permutations  and  (b) 
the  number  of  combinations  of  15  things  taken  3  at  a  time. 

(a)  From  III.6.1.,  15p3  =  15-14-13  =  2730 

(b)  From  III.6.4,  15C3  =  (15!)/(3!)  (12!)  =  455. 

Until  now  we  have  dealt  with  the  simple  probability  of  whether 
a  single  event  would  happen  or  would  not  happen.  But  we  are  also 
interested  in  finding  the  probability  that  two  or  more  events  will 
occur  together. 

For  an  illustration  of  a  compound  event,  we  may  toss  two 
pennies.  The  number  of  ways  in  which  two  pennies  may  lie  are: 
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HH,  HT,  TH,  TT.  The  probability  of  two  pennies  falling  heads  up 
is  thus  i.  Now  we  recall  that  the  probability  of  one  penny  falling 
heads  up  is  J  and  that  \  •  J  =  \.  This  indicates  that  the  probab- 
ility of  the  compound  event,  two  pennies  falling  heads  up,  is 
under  certain  conditions  the  product  of  the  probabilities  of  the 
two  separate  events,  each  event  being  a  penny  falling  heads  up. 
This  is  precisely  what  the  situation  is  if  the  separate  events  are 
independent. 

If  it  is  kept  in  mind  that  for  every  event  there  is  a  corresponding 
probability  p,  then  the  theorem  of  compound  probability  follows 
immediately  from  basic  principle  number  two  in  article  III. 6. 

III.  7.  Theorem  of  Compound  Probability.  If  the  probability  that  an 
event  will  occur  is  p1  and  if  after  this  event  has  occurred  the  probability 
that  a  second  event  will  occur  is  p2  then  the  probability  that  both 
events  will  occur  in  the  order  stated,  is  p1-p2> 

If  the  events  are  independent,  as  in  the  case  of  the  pennies,  it  is 
not  necessary  that  they  happen  in  any  definite  order.  The  com- 
bination a  "head  and  a  tail"  is  the  same  as  a  "tail  and  a  head". 

Corollary:  If  the  separate  elementary  events  are  independent, 
the  probability  of  the  compound  event  is  the  product  of  the 
probabilities  of  the  separate  events. 

If  there  are  x  independent  events  and  if  p  is  the  probability  of 
the  occurrence  of  each  independent  event,  the  probability  that 
the  event  will  occur  x  times  in  x  trials  is  px.  If  in  n  trials  q  is  the 
probability  that  the  event  does  not  occur,  and  if  x  (x  <  n)  is  the 
number  of  times  the  event  occurs,  then  n  —  x  is  the  number  of 
times  the  event  does  not  occur.  Clearly,  if  px  is  the  probability 
that  the  event  will  occur  x  times  as  specified,  qn_x  is  the  probab- 
ility that  it  will  not  occur  the  remaining  (n  —  x)  times.  Hence  the 
combined  probability  that  in  n  trials  a  specific  x  of  the  n  events 
will  occur  as  specified  is 

p  (x)  ==  px-qn~x  III.7.1. 

This  theorem  applies  to  a  set  of  events  as  well  as  to  a  single  event 
for  the  probability  for  the  occurrence  of  any  specific  set  of  x 
events  is  the  same  as  the  probability  for  any  other  set  of  x  events. 
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Consequently,  the  probability  of  the  event's  occurring  exactly  x 
times  without  the  restriction  of  its  being  a  specific  x  is  equal  to  the 
product  of  the  probability  for  any  specific  x  occurrences  by  the 
number  of  combinations  of  x  sets  there  are  in  n  events.  This  value 
has  been  shown  to  be  (III. 6. 4.)  equal  to 

n! 

nCx  <—       —  — 

x !  (n  —  x) ! 

Hence,  the  probability  P  (x)  of  the  event's  occurring  exactly  x 

times  in  n  trials  is 

n! 

P  (x)  ==  — p*q*-x  =  nCx  px  qn"x  III.7.2. 

x!  (n  — x)! 

where  x  may  assume  the  values  0,  1,  2,  . . .,  n.  This  is  a  funda- 
mental law  in  probability,  and  if  we  let  x  take  on  all  integral 
values  from  0  to  n,  we  obtain  the  respective  probability  for  each 
of  the  possible  and  mutually  exclusive  events. 

A  more  general  theorem  in  which  combinations  are  involved  is 
known  as  the  Binomial  Theorem. 

III.  8.  The  Binomial  Theorem  (applied  to  probability).  The  Binomial 
Theorem  states  that  if  the  probability  that  an  action  will  take 
place  in  a  particular  way  is  p,  and  the  probability  that  it  will  not 
be  so  performed  is  q,  then  the  probability  that  it  will  take  place 
in  exactly  n,  (n  —  1),  (n  —  2),  ...  3,  2,  1,  0  out  of  n  trials  is  given 
by  the  successive  terms  of  the  binomial  expansion : 

(p  +  q)n  —  pn  +  n  .pn-i  q  +  nln—      pn_2  q2  + III.8.1. 

which  is  known  as  the  Binomial  Distribution. 

It  will  be  noted  that  the  generating  term  is  of  the  form  nCx 
pr  qn_r.  For  the  purpose  of  illustration,  let  a  coin  be  tossed 
3  times.  In  this  case  p  =  q  =  \.  The  probabilities  of  getting  0,  1, 
2,  or  3  heads  are : 

(i)3,  3  (J)3,  3  (J)*,  (J)3 
and  these  are  the  successive  terms  of 

(P  +  q)3  =  p3  +  3  p2q  +  3  pq2  +  q3 
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Similarly  the  probabilities  of  getting  0,  1,  2,  3,  or  4  heads  are: 
(£)4,  4  (i)\  6  (J)*,  4  (J)*,  (£)* 
We  might  represent  the  possible  results  of  tossing  a  penny  four 
times  graphically,  as  shown  in  Figure  III.l. 


16 


4/|6 


^16 


Number  of    Trials 
Figure  III.l 
Graphical  Representation 
of  the  Possible  Results  of  Tossing  a  Penny 

The  possibility  of  each  number  of  heads  is  represented  on  the 
vertical  ordinate.  The  width  of  each  rectangle  is  equal  to  one  unit 
=  Ax.  The  area  of  each  rectangle  expressed  in  general  terms  is 

nCx  px  qn~x  Ax 

=  nCx  p*  qn~* 

This  means  that  the  area  of  each  rectangle  equals  the  probability 

of  getting  the  number  of  heads  corresponding  with  the  mid-point 

of  its  base.  The  entire  area  =  the  probability  of  getting  0,  1,  2,  3, 

or  4  heads  =-±-+-i-+-!i7-f-±T+-L  =1,  so  that  the  prob- 

16  lb  lb  lb  lb  * 

ability  of  getting  a  given  number  of  heads  is  equal  to 
Area  of  rectangle 
Area  of  whole  figure 
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Expressed   mathematically,    the    probability    of  getting    any 
number  of  heads,  x 

P(x)  =  si^  -^^  IIL8-2- 


n-x 


since  SnCx  px  qn~x  ==  1. 

In  the  example  given  p  =  q  =  J  with  the  result  that  the  graph 
of  the  distribution  is  symmetrical.  If  p  is  not  equal  to  q  the  distri- 
bution is  not  symmetrical  but  skewed.  It  is  also  clear  that  as  n  is 
increased,  the  area  can  be  accurately  represented  by  a  smooth 
curve.  It  is  only  in  the  long  run  that  the  relative  frequency  with 
which  an  event  happens  as  specified  may  be  compared  to  probab- 
ility. It  is  only  when  a  man  has  large  capital  that  he  can  play  long 
enough  to  take  advantage  of  the  odds  in  his  favor. 

A  quicker  and  more  efficient  way  of  obtaining  the  probabilities 
for  an  event  happening  as  specified  x  times  out  of  n  trials  is  by  the 
use  of  a  recursion  formula.  As  in  III. 8. 2.,  let 

P  (x)  = px  qn~x 

V  ;       x!  (n-x)r    4 

Then, 

P  (X  +  l)  =  (x+ 1)1   (n-x- 1)1  PX+1  qn"X"1    IIL8'3- 
Dividing  III.8.3.  by  III.8.2.,  we  get 


P  (x  +  1)  =  (n-x)    p 
P  (x)    "  ""  x  +  1   '  q 


III.8.4. 


whence,  P  (x  +  1)  =  — — —  •  -  •  P  (x)  III.8.5. 

x  +  1     q 

To  obtain  the  values  shown  in  the  tabular  form,  we  proceed  as 
follows :  Let  x  =  0,  then  from  III.8.2  it  is  found  that  P  (x)  =  P  (0) 
=  qn.  Next,  from  III. 8. 5.,  we  find  that  where  x  =  0, 

P(1)=*-?P(0) 

i  q 
n  p 

— .  qn  =  nq11-1  p. 
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Then,  let  x  =  1  in  III.8.5.,  and 

P(2)=^?P(1) 

2        q 


n—  1  p 
~T~'q 
n(n-l) 


•  nqn_1  p 


qn-2  p2 

2!        ^ 

Continuing  in  this  way,  all  the  probabilities  of  happenings  may  be 
obtained  and  they  are  shown  in  the  following  table  for  the  different 
possibilities. 

Table  III.l 
Binomial  Distribution 

Number  of  Probability  of 

Happenings  Happenings 

0  qn 

1  nqn_1  p 

2  n<n-1>q.-»p. 

2! 


n(n— 1)  (n-2) 
3! 


n-3  -r.3 


qu-°  p 


n! 
x !   (n  —  x) 


qii-x  pX 


n  pn 


Such  a  description  of  happenings  is  designated  a  probability 
distribution  or  a  relative  frequency  distribution  in  the  case  of  a 
sample.  If  each  of  the  probabilities  were  multiplied  by  the  number 
of  individuals  (number  of  cases  or  number  of  trials),  we  would  have 
the  corresponding  theoretical  (absolute)  frequency  distribution. 
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III.  9.  Modal  Term  of  Binomial  Distribution.  The  Binomial  distri- 
bution is  analyzed  by  finding  the  modal  term,  the  arithmetic  mean, 
and  the  variance.  To  find  the  modal  term  we  take  the  generating 
term, 

n! 

P  (x)  =  — p*  q°-x 

x !  (n  —  x) ! 

of  the  binomial  distribution  and  find  the  value  of  x  such  that  the 
xth  term  will  be  a  maximum  and  hence  be  greater  than  or  equal 
to  either  the  (x  +  l)th  term  or  the  (x  —  l)th  term.  In  other  words, 
the  ratio  of  the  xth  to  the  (x  +  l)th  term  or  the  (x  —  l)th  term  is 
equal  to  or  greater  than  one.  Thus 

n! 

px  qn-x 


P(x)  x!(n-x)! 

^  1  and 


P(x+1)  n 


(x  +  1)!  (n—  x  — 1)! 
n! 


px+l  qn-x-l 


pxqn- 


X 


P(x)  x!  (n-x)l 


P(x  — 1)  n! 

(x  — 1)!  (n  — x  +  1)!  *       H 

Simplifying  these  two  inequalities,  we  find,  respectively,  that 

x+1     q 


n  —  x     p 

n  —  x+1     p 


^>  1  or  x  ;>  pn  —  q     and 

^  lorx  ^  pn  +  p 


x  q 

Now,  if  x  is  the  modal  or  maximum  value  of  x, 

pn  — q^x^pn+p  III. 9.1. 

Thus  neglecting  a  proper  fraction,  pn  is  the  most  probable  or 
modal  value.  If  pn  —  q  and  pn  +  p  are  integers,  then  there  exist 
two  equal  terms  which  are  larger  than  all  the  others.  This  is  the 
same  as  saying  that  if  the  chance  of  n  events  happening  is  -,  then 
in  30  trials  it  is  most  likely  to  happen  10  times. 
Examples:  (a)  What  is  the  greatest  number  of  times  the  event 
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will  happen  as  specified  when  there  are  n  =  11  trials  and  when 
p  =  q  =  |-.  From  III. 9.1.,  we  find  that  x  is  either  5  or  6. 

(b)  If  n  =  12  trials  and  p  =  q  ==  |,  x  =  6. 

(c)  If  n  =  15  trials  and  p  =  |  and  q  =  |,  x  =  2. 

(d)  If  n  =  18  trials  and  p  ==  \  and  q  =  |,  x  =3. 


6  *  6 


(e)  If  n  =  23  trials  and  p  =  \  and  q  =  f-,  x  =  3  or  4. 


6  *  6 


III.  10.  Arithmetic  Mean  of  Binomial  Distribution.  Let  x  be  the  arith- 
metic mean  (mathematical  expectation  -  probable  or  expected  number 
of  times  the  event  will  happen  as  specified  in  n  trials  under  the  law 
of  repeated  trials).  By  definition,  the  arithmetic  x  of  x  is 


±«(  n! 

x)!^^ 

n                                1 

^"x!  (n- 

^^.qn-, 

x  =  -^— ■ III.10.1. 


But  the  denominator  is  the  total  probability  which  is  equal  to  1. 
Simplifying, 

n  (n  -  1) 
x  =  0-qn  +  l-nqn-ip  +2  •  — — - qn~2  p2  +  . . . 

Zi ! 

/                                            (n  -  1)  (n  -  2)  \ 

=  np  [qn-i  +  (n  -  1)  q*-2p  +  * Lj V"3  P2  +  . .    \ 

=  np  (q  +  p)n-i  =  np  (1)  =  np.  III.10.2. 

Illustrative  Example   1:    Given  p  =  -J  and  n  =  18,   and  q  =  | 
required  to  find  the  mean  x. 
Substituting  in  III.10.2, 

x  =  18-^=3. 
The  answer  may  be  interpreted  to  mean  that  in  the  long  run  the 
event  will  happen  one  time  in  6  trials  and  therefore  in  18  trials  we 
would  expect  the  number  of  occurrences  to  be  3,  while  the  actual 
number  of  occurrences  in  a  single  trial  may  be  x  =  0,  1,  2,  3,  .  .  . ,  18. 

Illustrative  Example  2 :  Suppose  that  it  has  been  ascertained  from 
a  traffic  count  that  on  the  average  30  per  cent  of  the  vehicles  turn 
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left,  what  is  the  probability  that  (a)  a  specific  3  out  of  5  (say  the 
first  3)  vehicles  will  turn  left,  (b)  any  three  (exactly  3),  out  of  5 
vehicles  will  turn  left. 

(a)  In  the  first  case,  III. 7.1.,  p  (x)  =  px  qn_x  becomes 

p  (3)  ?=  (.3)3  (.7)2  =  .01323  III.10.3. 

n! 

(b)  In  the  second  case,  III. 7. 2.,  P  (x)  = pxqn_x 

x!(n  — x)! 

becomes 

5! 
P  (3)  = (.3)3  (.7)2  =  .1323  III.10.4. 

3!  2! 

The  answer  found  in  III.  10.3.  means  that  in  the  long  run,  1323  times 
out  of  100,000,  a  specific  3  (say  the  first  3)  out  of  each  group  of 
5  vehicles  will  turn  left.  The  answer  found  in  III.  10.4.  means  that 
in  the  long  run,  1323  times  out  of  10,000,  any  3  out  of  each  group 
of  5  vehicles  will  turn  left. 

III.  11.  Variance  of  Binomial  Distribution.  Another  important 
measure  is  the  arithmetic  mean  of  the  squares  of  the  differences 
between  the  number  of  times  the  event  will  happen  as  specified 
and  the  expected  number  of  times  the  event  will  happen  as  speci- 
fied. Recall  that  in  Chapter  II  in  discussing  frequency  diagrams 
we  spoke  of  this  as  being  similar  to  the  square  of  the  radius  of 
gyration.  This  quantity  is  called  the  variance.  To  obtain  its  value, 
if  <T2  is  the  symbol  for  variance,  then 

E  (x-np)2  =  cr2^x(x!(nn-x)!)pXqn"X  (x~nP)2  HLlLL 
But 

E  (x  —  np)2  =  E  (x2)  —  [E  (x)]2  III.  11.2. 

Since,  we  have  already  found  the  value  of  E  (x)  to  be  np,  it 
suffices  to  obtain  the  value  of  E  (x2).  By  the  definition  of  expected 
value, 


n! 

ix  "  \x!  (n  —  x)! 


B  (*)-%.■*      .,_    ,,p*q»-« 
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n(n— 1) 
=  0.qn  +  l.nq^p  +  4  — qn-2p2 

—  ! 

n(n-l)(n-2) 

+  9 J] qn_3p3  + 


T       ,                            .        3(n— l)(n  — 2)  1 

=  np  |q-»>  +  2  (n-  1)  q""2?  + ^ q°-3p2  +  •  •  J 

=  np  I  (q  +  p)»-i  +  (n  -  1)  p  |q°-2  +  (n  -  2)  q"-3  p 
(n-2)(n-3) 


— ^| -q°"V  + 


1] 


=  np[l4-(n-l)(p)(q  +  p)a-2] 

=  np[l  +(n—  l)p]  =np  +  n2  p2  —  np2  III.11.3. 

Substituting  the  values  from  III.11.3.  and  III.10.2  in  III.  11.2., 
we  find 

a2  =  E  (x  —  np)2  =  E  (x2)  —  [E  (x)]2    becomes 
a2  =  np  +  n2  p2  —  np2  —  n2  p2 

=  np  —  np2=np(l — p)  =  npq  III. 11. 4. 

Illustrative  example'.  Given  p  =  -|,  q  =  |,  and  n  =  18.  From 
III.11.4.  we  find  that  <j2  =  18  (f )  (f )  =  2.5.  This  means  that  in 
18  trials  we  would  expect  the  number  of  occurrences  to  differ  from 
3  by  2.5.  In  other  words,  we  would  expect  the  actual  number  of 
occurrences  to  lie  between  3  —  2.5  =  0.5  and  3  +  2.5  =  5.5, 
namely,  between  1  and  5. 

In  the  case  of  relative  frequency  or  relative  number  of  occur- 
rences, if  (x/n  —  p)  is  the  difference  between  the  observed  number 
of  occurrences  out  of  n  and  the  probability  p  of  occurrence,  then 
it  is  not  hard  to  show  that 

E(x/n-p)2  =  ^M2=^=?5.  III.11.5. 

n2  n2       n 

III.  12.  Size  of  Sample  Required  for  Stability.  At  this  point  it  should 
be  noted  that  we  are  thinking  of  the  relative  frequencies  in  many 
random  samples,  and  that  we  are  concerned  about  the  degree  of 
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stability  or  the  degree  of  dispersion  of  such  a  series  of  relative 
frequencies.  This  is  a  fundamental  problem  in  statistics.  In  the 
binomial  distribution,  sometimes  called  the  Bernoulli  distribution, 
we  assume  that  the  underlying  probability  remains  constant  from 
trial  to  trial  and  from  sample  to  sample  and  that  the  drawings  are 
mutually  independent.  This  assumption  is  implied  in  so-called 
simple  sampling. 


Returning  to  Bernoulli's  theorem,  III.3.1.,  let  s  =  xl/  — ,  (X>  1) 


=4/p-q- 

r    n 


In  the  Bienayme-Tchebycheff  inequality,  III. 5.2.,  let  D  =  Vpq/n. 
Then 

1  pa 

P(XD)  ^  —  becomes  P  (e)  ^  -^  III.12.1 

X2  n  e2 

It  may  be  seen  from  III.  12.1.  that  as  n  tends  to  infinity,  y)  =  P  (e) 
tends  toward  zero.  This  proves  Bernoulli's  theorem  for  any  dis- 
tribution law  of  probability  by  the  use  of  Bienayme-Tchebycheff 
criterion  as  was  suggested  in  III. 5. 

In  order  to  get  a  comparison  of  the  results  obtained  by  articles 
III.3.,  III.4.,  III.5.,  let  e  =  0.01,  p  =  0.1,  q  =  0.9,  X  =  2  ]/~5 
=  4.472  and  7)  =  0.05.  Substituting  these  values  in  III.  12.1., 

7)=P(S)<^2 

ns2 

=  P(.01)  =  0.05^(,1)('9) 
V      }  n(.Ol)2 

whence  n  ^  18,000. 

Again  let    e  =  0.05,  p  =  0.1,   q  =  0.9,  X  =  2  y?  =4.472    and 

7j  =0.05.  Substituting  these  values  in  III.  12.1.,  we  get 

*)=P(£)<;^ 

ns2 

=  P  (.05)  =  0.05  £^9 

n  (.05)2 

whence  n  ^  718. 

Comparing  these  results  with  those  previously  found,  it  is  seen 

that  they  are  materially  less  as  was  indicated  previously.  It  is 
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noted  that  n  is  a  maximum  when  p  —  q  =  J  for  then  pq  is  the 
maximum.  Hence,  it  is  always  safe  to  take  the  value  of  n  when  p 
and  q  equal  \  as  the  minimum  value  of  n.  That  is,  in  case  the 
values  of  p  and  q  are  not  known,  it  is  safe  to  use  p  =  q  =  \  in 
determining  the  size  of  sample  required.  In  many  traffic  problems, 
p  is  very  small  and  q  very  near  unity  which  will  require  a  smaller 
sample  for  stability  than  if  p  were  equal  or  nearly  equal  to  q. 

Additional  means  of  characterizing  the  binomial  distribution 
are  moments  about  the  mean.  These  are: 

^  =  0 

fx2  =  npq 

U-Z  =  npq  (q  —  p) 

{jl4  =  3  p2 q2  n2—  pqn  (1  —  6  pq)  III.12.2. 


fxx=2jti-nPH?)qn~ip3 

V*+i  =  Pq   nx(jLx_!  + 


J 

dfJLx 


dp 

where  I  .  J  is  the  number  of  combinations  of  n  things  taken  ;  at  a 


J 
time  and  n  is  very  large. 

Other  characterizing  means  are  the  p  coefficients : 

npq 

62=3+    ~     Pq  III.12.3. 

npq 

Pj  is  a  coefficient  of  skewness,  while  p2  is  a  coefficient  of  kurtosis 
or  "peakedness". 

The  theorems  of  Bernoulli  and  Cantelli  and  the  Bienayme- 
Tchebycheff  criterion  are  devoted  to  obtaining  a  lower  limit  to  the 
probability  that  the  experimental  error  will  not  exceed  a  given 
amount. 

The  binomial  distribution  and  particularly  its  generating  func- 
tion P  (x)  given  in  III. 7. 2.  gives  the  actual  probability  of  the 
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event's  occurring  exactly  x  times  in  n  trials,  so  that  it  is  possible 
to  determine  the  actual  probability  of  the  event's  occurring  be- 
tween any  two  specified  number  of  times  in  n  trials.  This  is  ac- 
complished by  adding  the  respective  separate  probabilities  in- 
volved since  the  events  are  mutually  exclusive. 
The  function  P  (x)  is  given  by 

n! 

P(x)=— r-.V*qn-* 

x!  (n  —  x) ! 

The  function  P  (x)  is  a  fundamental  law  of  probability  for  all 
positive  values  of  x,  integral  oi*  fractional.  The  function  is  con- 
tinuous almost  everywhere  (i.  e.  except  for  negative  integers)  and 
has  a  unique  value  for  every  positive  value  of  x.  It  is  simple  enough 
to  handle  if  x  is  an  integer.  It  is  quite  difficult  and  cumbersome  if 
x  is  not  a  positive  integer. 

In  practice  it  is  most  usable  when  x  is  a  whole  number.  Many 
times,  however,  x  is  not  a  whole  number.  It  then  becomes  im- 
perative, if  possible,  to  derive  from  the  function  given  in  III. 7. 2. 
another  continuous  function  which  is  easier  to  use  and  also  gives 
us  the  actual  probabilities  (not  lower  limits  only)  that  are  desired 
to  be  known. 

Two  such  functions  are  the  Normal  Distribution  and  the  Poisson 
Distribution.  We  shall  now  develop  and  discuss  these  two  func- 
tions. 

III.  13.  The  Normal  Distribution.  The  normal  distribution  is  a  con- 
tinuous approximation  to  the  binomial  distribution  when  n  is  large 
and  p  and  q  are  not  small. 

Let  us  reexamine  the  generating  term  P  (x)  of  the  binomial  dis- 
tribution, namely, 

P  (x)  =     ,,n!      ,.Pxqn-*  IIL13.1. 

x!  (n  —  x) ! 

The  graph  of  this  equation  is  a  set  of  points  whose  abscissas  are  x 
values  and  ordinates  are  the  corresponding  P  (x)  values  for  all 
values  of  x  from  zero  to  plus  infinity.  The  function  P  (x)  is  con- 
tinous  almost  everywhere  (i.  e.,  except  for  negative  integers). 
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For  our  purpose,  it  is  convenient  to  translate  the  origin  to  the 
mean  or  expected  value  of  x.  This  requires  that  we  substitute 
x  =  x'  +  np  for  x  in  III.  13.1.  It  then  becomes 

n! 

P(x')  = pPQ+x  q*n+x  III.13.2. 

'       (x'  +  np)!(nq-x')r  4 

If  we  consider  unit  intervals  only,  this  probability  that  the 
number  of  occurrences  will  lie  between  np  —  k  and  np  +  k,  in- 
clusive of  end  values,  is 

2x.P(x')  =  P(-k)  +P(-k  +  1)  +  . . .  +P(0)  +P(1)+.  •  •  +P(k) 

IIL13.3 

This  follows  from  the  fact  that  the  resultant  event  is  obtained  by 
compounding  a  set  of  mutually  exclusive  events  in  which  case  the 
resultant  probability  is  the  sum  of  the  probabilities  of  the  set  of 
mutually  exclusive  events. 

To  simplify  III.  13.2.,  if  the  number  of  trials  n  is  large,  it  is  con- 
venient to  use  Stirling's  asymptotic  approximation  for  n !  which  is 
n!=nne-n(2n)i(l+i^+^  +  ...)  III.13.4. 

or 

n!  =  y2^e-nnn+'  III.13.5. 

if  the  first  term  of  III.13.4.  only  is  used.  If  III. 13.5.  is  used,  the 
result  obtained  is  equal  to  the  true  value  divided  by  a  number 
having  a  value  between  1  and  ^. 

Remembering  that  n  is  large  and  using  III.  13.5.  for  all  the 
factorials  in  III.  13.2., 

_  1  /  x'\  — Pn— x'— £/  X'\-<ia  +  x'-T 

P(x')=- -r   1+—  \l IIL13.6. 

(2  Trnpq)?  \         pn/  \         qn/ 

Transforming  III.  13. 6.  by  taking  logarithms  of  both  sides  of  the 
equality, 

loge  P  (x')  =  —  loge  (2  Tinpq)*  —  (np  +  x'  +  £)  loge  ((1  +  — 

-  (qn  -  x  +  I)  loge  ( 1  -  -H  IH.13.7. 
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Expanding  loge  1 1  H 1  and  loge  ( 1 )  in  power  series  of  x', 

III.13.7.  becomes 
loge  [P  (x')]  [Snpqjp  =  -  (np  +  x'  + 1)  g  -  ^  -  ^R  (x')] 

[x'         x'2         x'3  1 

--;r-^-^s(x')  IIL13-8- 

nq       2  n2q2       n3  J 

To  make  this  expansion  valid,  it  is  necessary  to  assume  that  n 

x' 

is  sufficiently  large  so  that  —  is  sufficiently  small.  It  follows  that 

R  (x')  and  S  (x')  are  finite. 

Simplifying  III.  13.8.,  and  performing  the  multiplying  opera- 
tions indicated,  we  find  that 

loge[P(x')]  [2Tunpq]i=(P~q)x/  -  ^—  +  ~  T  (x')        III.13.9. 
L  2  npq  2  npq      n2 

The  equation  III.  13.9.  may  be  written  in  the  form 

loge  [P  (x')l  [2  7unpq|*  =  -  — U  (x')       III.  13. 10. 

2  npq       n 

where  U  (x')  is  also  finite. 

Now  if  n  is  large  enough  (in  other  words,  n  must  be  very  large) 


so  that    —   U  (x')  is  very  small  (negligible  or  within  the  allow- 


able error),  then  ignoring  this  term,  III.  13. 10.  may  be  written  as 

1  =£! 

P(X')  = ^2npq  111.13.11. 

(2  7inpq)^ 
which  is  called  the  normal  distribution. 

It  appears  that  this  was  first  known  to  DeMoivre  in  November, 
1732.  Multiply  both  sides  of  the  equality  III.13.3.  by  Ax',  then, 

k 

Zx,P(x')Ax'=P(-k)Ax'  +  P(-k  +  l)Ax'+ +  P(0)Ax' 

-k 

+  P(l)Ax'+P(k)  Ax' 
and  on  the  assumption  that  P  (x')  is  continuous, 

k  ,  /?k  x* 

Lim    V    P(x')Ax'i= r       e      2npqdx'  111.13.12. 

t*+.  *£*  (2^npq)g_k 
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The  right  hand  member  of  III.  13. 12  is  known  as  the  probability 
integral.  It  gives  the  probability  that  a  random  variable  x'  has  the 
value    —  k  ^  x'  <,  k. 

If  P  (x')  is  discontinuous  and  the  ordinates  are  at  unit  intervals, 
then  in  III.  13.3.  there  is  one  more  ordinate  than  intervals  of  area. 
Hence, 

e2nPqdx'     approximately.        111.13.13. 

The  above  results  summarized  lead  to  the  well-known  DeMoivre- 
Laplace  theorem,  namely8 : 

The  probability  that  the  difference  x'  =  x  —  np  between  the 
number  of  occurrences  x  and  the  expected  number  of  occurrences  will 
not  exceed  a  positive  number  h  is  given  to  a  first  approximation  by 
III. 13. 12  and  to  closer  approximation  by  III. 13.13. 


III.  14.  Interpretation  of  the  Properties  of  Normal  Distribution.  The 
special  form  of  the  normal  distribution  as  given  in  III. 13. 11.  is  re- 
stricted to  the  conditions  that  n  is  large  and  p  and  q  are  not  small 
thus  giving  a  continuous  approximation  to  the  binomial  distri- 
bution. 

1        =* 

Now  consider  P  (x)  = e  2'8  III.  14.1. 

aV  2tc 

where  a  is  the  standard  deviation  with  the  restriction  that  it  is 
finite  such  that   0  <;  c  <^  k. 

The  graph  of  the  equation  is  shown  in  Figure  III. 2. 

From  III.  14.1.,  it  is  seen  that  the  curve  is  symmetrical  with 
respect  to  the  y-axis.  Likewise  the  curve  has  a  maximum  point  at 
x  =  0,  namely  at  the  point  whose  abscissa  is  the  arithmetic  mean. 
There  are  two  points  of  inflection,  namely  Px  and  P2  each  of  which 
are  at  a  distance  a  from  the  arithmetic  mean.  The  curve  is  asymp- 
totic to  the  x-axis  at  both  plus  and  minus  infinity. 

From  III.  14.1.  or  from  tables,  it  is  found  that  the  total  area 
under  the  curve  is  unity,  the  area  between  x  =  —  a  and  x  =  +  a 
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is  0.6827,  the  area  between  x  =  —  2  cr  and  x  =  +  2  a  is  0.9545, 
and  the  area  between  x  =  —  3  a  and  x  =  +  3  a  is  0.9973.  If 


—  f 


x  1 

2^"dx  =- 

2 


then  x  =  0.67449  a 

which  is  known  as  the  probable  error. 


III.14.2. 


Figure  III.2 

1         -x« 

Graph  of  the  Equation    P  (x)  =    -  —  e  Zo1 

al/2  7r 

As  an  illustration,  c6nsider  again  the  case  y]  =  0.05,  e  =  0.01. 
From  the  Bienayme-Tchebycheff  inequality,  X  =  t  =  4.472.  Now, 
let  p  =  q  =  \.  Then,  from  EH.  11.5.  and  III.12.L, 


becomes 


4.472 


W  (i) 


n 


£0.01 


whence  n  ^>  500 

Similarly,  if  73  =  0.05  and  e  =  0.05 
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?*£. 


es  4.472  |/(™t)  ^  0#06 


whence  n  ^>  100. 

Again,  let  p  =  \  and  e  —  0.01.  The  value  of  t  such  that 

-—    I  e    2  dx  =  0.99  =  1  —  7] 
V*«JL 

pqt2 
is  2.58.  But  n  ^  ■ .  Hence,  solving  for  n,  it  is  found  that  n  ^  166 

z 

and  if 


2       P*  -— 

=    I  e     2  dx  =  0.95  =  1  — 


t  =  1.96  and  n  ^  97,  if   s  =  0.01. 

Under  certain  conditions  where  p  =  q,  the  equation  of  the  con- 
tinuous approximation  curve  is  given  by 

—      NPP+1      e-Wl+^a  ni.14.3. 


aePr(p  +  l)  \        a 

where  the  origin  is  at  the  mode. 

The  question  is  often  raised :  How  is  it  known  that  the  distribu- 
tion is  normal  ?  A  very  good  answer  is :  If  it  can  be  justified  axio- 
matically  that  the  arithmetic  mean  is  the  most  probable  value,  then 
the  distribution  is  normal.  This  is  known  as  the  postulate  of  the 
arithmetic  mean.  Another  way  is:  If  (3X  =  0  and  (32  =  3  (See 
11.25.17.  and  11.25.18.),  the  distribution  is  normal. 


III.  15.  Poisson Distribution.  This  distribution  is  frequently  thought 
of  as  the  law  of  small  probabilities  or  the  law  of  rare  events.  It 
appears  to  be  especially  useful  in  solving  many  traffic  problems 
(see  Chap.  V). 
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Consider  again  the  generating  term  of  the  binomial  expansion, 

PW=x!(n-x)!P;Cqn"X  IIL15-L 

the  probability  that  in  n  trials  exactly  x  of  them  will  take  place  as 

specified,  where  p  is  the  probability  that  the  event  in  a  single  trial 

will  occur  as  specified. 

Equation  III.  15.1.  may  be  written  as 

n(n— 1)  (n  — 2) (n  —  x  +1) 

P  (x)  =  - ^ '—, * V  (1  -p)*-* 


X! 


III.15.2. 


m 
Write  p  —  —  where  m  is  the  number  of  times  a  given  happening 

occurs  in  n  trials.  Substituting  this  value  of  p  for  p  in  III.  15.2., 

'n"n-IW"-1--mS)('-" ,  . 

III.15.3. 

Now,  hold  both  x  and  m  fixed  and  let  n  approach  infinity.  Then, 
in  the  limit, 


P(x)  = 


n/  \    n 


n  n—  1 


1, 


To  obtain  the  limiting  value  of  1 1  —  —  I    we  set 


n 


n  — x  +  1  /        m\~x 
=1,  and    1—1      =1, 

n  \         n, 

m\n 


III.15.4. 


m 

1 IK 

n 


The  limiting  value  of 
infinity  is  e_1.  Hence 

Lim 


m\* 

I In 

n 


m 
1 In 

n 


as  n  approaches 


=  e 


IIL15.5. 


Substituting  all  the  limiting  values  just  found  in  III.  15.2.,  we 
obtain 


m: 


P  (x)  «  (1)  (1)  (1) (1)  —  e-m.  (1)  III.15.6. 
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which  may  be  written  as 


P(x)  = 


mx  e-m 


III.15.7. 


which  is  Poissorfs  distribution  or  the  Poisson  Exponential  Func- 
tion. This  function  is  a  continuous  approximation  to  the  binomial 
distribution  when  p  is  small  and  n  is  large. 

The  function  is  continuous  almost  everywhere  and  has  a  real 
value  for  all  values  of  x  except  negative  integers.  For  negative 
integral  values  of  x,  P  (x)  is  not  defined.  The  continuity  is  obvious 
if  it  is  recalled  that  x !  is  related  to  the  Gamma  Function,9  that  is : 


!  =   PVe-ydy  =T  (x  +  1) 


III.15.8. 


The  graph  of  the  function  is  shown  in  Figure  III. 3.  Also  tables 
(Tables  f or  Biometricians  and  Statisticians,  pp.  122-124)  of  values 
for  Px  exist. 


6 

U™=0.5 

._ 

4 

/m  =  |.o 

o 

3 
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> 

s<**2.0 

%  c 
.1 

^m=5.0 

^ 
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0 

>                 A 

t 

I 

1 

D              1 

2             1 

4               1 

6              1 

8            20 

Values  of  i 
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From  the  figure  it  is  seen  that  for  small  values  of  m  the  curve  is 
highly  skewed  and  that  as  the  values  of  m  increase  the  curve  be- 
comes more  symmetrical. 

In  all  cases,  p  must  be  small  and  n  must  be  large,  but  small 
values  of  m  as  well  as  large  values  of  m  are  possible  under  these 
conditions.  It  is  also  quite  important  to  note  that  as  m  becomes 
larger,  the  agreement  between  III.  15.7.  and  III.  13.11.  becomes 
closer. 

III.  16.  The  Sum  of  the  Terms  of  the  Poisson  Distribution.  Since  each 
term  is  the  probability  for  the  event's  happening  x  times,  the  sum  of 
the  probabilities  for  each  of  these  possibilities  should  equal  unity 
because  some  one  of  the  possibilities  is  certain  to  take  place.  Letting 
x  take  successively  the  values  0,  1,  2,  . . . .,  the  sum  of  the  re- 
spective terms  is 

"    mxe~m       m°e-m      me"m      m2e"m 


+  -T^-  +  — Z7-  + 


x!  o!  1!      '       2! 


m3 


=  e"m,I+l!  +  l!+3!  + ,IL1(U- 


The  series  in  parentheses  has  the  value  em.  Hence 
oo     mxe-m 


x      X! 


em  =e°=  1  III.16.2. 


•m^m 


III.  17.  The  Arithmetic  Mean  of  Poisson  Distribution.  If  x  is  the 
arithmetic  mean  number  of  happenings,  then 


_        »     mxe~m 

x  =  >     i —  x 

Zjx     x! 


m°e 

-ra 

+ 

me-m 
1! 

1 

m2e~ 
+      2! 

m 

2 

+ 

m3e" 
3! 

-m 

o! 

= 

me" 

mfl 

+ 

m 

V.+ 

m 

2            m3 

-+3! 

+  ■ 

■■] 

me' 

mem 

= 

m. 

3+. 


III.17.1. 
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III.  18.  The  Variance  of  Poisson  Distribution.  Since  variance  is  the 
expected  value  of  the  squares  of  the  measurements  minus  the 
square  of  the  expected  value  of  the  measurements,  we  will  first 
obtain  the  expected  value  of  the  squares  of  the  measurements.  It 
is  given  by, 

*     mxe"m    a 

o 

m°e-m  me"m  m2e~m  m3e-m 

= 0  -| 1  H 4  H 9  + 

0!  1!  2!  3! 

m  f         2  m       3  m2  1 

l-+h^  + )] 

[em+m(1+rr+fr  + )] 

em  +  mem  h=  m  +  m2  III.18.1. 

But  the  square  of  the  expected  value  is  m2.  Hence 

a2  =  E  (x2)  -  [E  (x)]2 

=  m  +  m2  -  m2  =  m  III.  18.2. 

Example  1.  There  occurred  at  a  certain  highway  intersection  6 
accidents  during  the  passing  of  10,000  vehicles.  In  this  case  p  = 
0.0006  and  n  =  10000.  Suppose  we  wish  to  know  the  probability 
that  the  number  of  accidents  lies  between  3  and  9  per  10000  ve- 
hicles. Making  use  of  III.  13. 13.,  we  find  that 

-x'» 


me"m 


i     r+t^_  c 

p  (x) = (^^u  J.,!1;-9928^  -  °-02654  J_ 


ell.9928  dx' 

From  tables  of  the  normal  probability  function  it  is  found  that  if 

x'1         3.5 

z  =  —  = =  1.429 

a        2.449 
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then 

0.02654    I  eii.99 


the  desired  probability. 

To  calculate  the  probability  from  the  Poisson  distribution  with 
m  =  6,  we  add  the  probabilities  for  the  event's  happening  3,  4,  5, 
6,  7,  8,  and  9  times  as  taken  from  the  Poisson  tables10  for  indi- 
vidual terms : 


Happenings 

Probability 

3 

.089235 

4 

.133853 

5 

.160623 

6 

.160623 

7 

.137677 

8 

.103258 

9 

.068838 

Total  Probability 

.854107 

We  may  also  use  the  table  for  cumulated  terms  and  substract 
the  probability  for  10  or  more  happenings  from  the  probability  for 
3  or  more  happenings  with  m  =  6. 

Happenings        Probability 
3  or  more  .938031 

10  or  more  .083924 


.854107  =probability  of  3  to  9. 

Again  if  the  binomial  distribution  is  used,  the  value  of  the  de- 
sired probability  is  0.854. 

These  results  show  that  there  is  little  difference  between  the  use 
of  the  so-called  normal  distribution  and  the  Poisson  exponential 
function,  while  the  Poisson  exponential  function  is  a  better 
approximation  than  the  Bernoulli  distribution  for  rare  events, 
that  is  events  with  small  probability. 

Example  2.  For  a  given  period  of  time,  at  a  certain  point  on  a  high- 
way, it  is  observed  that  on  the  average  three  heavy  trucks  per 
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100  vehicles  pass  the  point.  A  subsequent  sample  contains  six 
heavy  trucks  per  100  vehicles.  Using  the  Poisson  exponential  dis- 
tribution, compute  the  probabilities  of  0,  1,  2,  3,  4,  5,  6,  7,  and  8 
heavy  trucks  per  100  vehicles  using  m  =  np  =  3. 
The  probability  distribution  is  shown  in  Table  III. 2. 


Table  III.2. 

X 

Px 

X 

Px 

0 

.0498 

5 

.1008 

1 

.1494 

6 

.0504 

2 

.2240 

7 

.0216 

3 

.2240 

8 

.0081 

4 

.1680 

This  table  shows  that  (1)  the  probability  of  obtaining  one  heavy 
truck  in  a  sample  of  100  vehicles  is  0.1494;  (2)  the  probability  of 
getting  more  than  three  heavy  trucks  is  .5768;  (3)  the  probability 
of  getting  at  least  six  heavy  trucks  is  .3080. 

The  probability  of  six  or  less  than  six,  being  .9664  with  a  level 
of  significance  of  1  —  .9664  s=  .0336,  indicates  that  on  a  5  per  cent 
level  we  have  grounds  to  reject  the  hypothesis  that  this  number 
of  heavy  trucks  is  not  significant. 

In  obtaining  the  size  of  the  sample  so  that  the  error  from  the 
arithmetic  mean  is  one  heavy  truck,  namely,  that  the  number  of 
heavy  trucks  is  between  2  and  4,  the  reasoning  is : 

The  standard  deviation  is 

cr  =  m  =  np  =  n  (.03) 
and  since  e  =  1,  it  is  clear  that 

e  =  tm 
becomes  1  s=  (%)  n  (.03) 

which  gives  n  e=  100 

and  the  sum  of  the  probabilities,  namely 

.2240  +  .2240+  .1680  =  .6160,  the  measure  of  certainty. 

Example  3.  Required  to  find  the  probability  of  n  cars  appearing 
within  an  interval  of  time  r  beginning  at  the  instant  t.  Then 
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p  (n,  r,  t),  the  probability  of  n  cars  within  an  interval  of  time  r  be- 
ginning at  the  instant  t,  is  given  by 

p(n)r>t)=^-§_ 

n! 

where  K  is  the  expected  number  of  cars  in  the  interval. 

III.  19.  Dispersion  and  Variance.  Thus  far  it  has  been  assumed 
that  the  relative  frequency  (sample)  or  the  probability  (universe) 
that  an  event  will  happen  as  specified  remains  constant  through- 
out the  entire  field  of  observation.  There  are  many  cases  where 
the  underlying  probability  (relative  frequency)  does  not  remain 
constant.  This  indicates  that  it  is  necessary  that  the  statistician 
obtain  all  the  available  knowledge  from  the  data  by  properly 
classifying  them  into  subsets  for  analysis  and  comparison.  In  other 
words,  it  is  valuable  to  know  whether  the  relative  frequencies  or 
probabilities  vary  from  case  to  case  or  from  set  to  set. 

Consider  the  following:  Given  N  independent  quantities  X1?  X2, 
. . . ,  XN  such  that  the  mean  or  expected  value  E  (Xi)  of  Xi  is  at 
and  the  mean  or  expected  value  E  (Xi2)  of  Xi2  is  Ai.  Then,  if 

-       /X1+X2  +  ...  +XN\ 

X  ==  I — 1  and  a  =  (ax  +  a2  +  .  ■  ■  +  an)/JST,  it 

has  been  shown  ("Probability,"  by  J.  L.  Coolidge,  Oxford  Press, 
1925,  p.  67)  that 

r N  i        ^_ l N  N 

^.(Xi-X)2  ^-^-^^Ai-a^  +  ^t^-^2  HLW.1. 

If  the  observations  are  from  homogeneous  data,  ai  =  a,  Ai  f=  A. 
In  such  a  case,  III.  19.1.,  reduces  to 

N  1      N  —  1 

E    2i(Xl~ X)2 1       "IT" -N(A-a2)  =  (N—  1)(T2  m.19.2. 

since 

a2  =  E  (X2)  —  [E  (X)]2  =  A  —  a2. 
The  relationship  given  in  III.  19.2.  reduces  to 

a2  =  E  [£  (xi  -  X)2/(N  —  1)1  III.  19.3. 


E 
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Suppose  now  that  a  set  N  =  Ik  independent  items  has  been 
observed  and  classified  in  some  relevant  manner,  say,  in  1  rows  of 
k  items  each  as  shown  in  Table  III. 3. 


Table  III.3. 


Xu,    X12,     • 

. . . ,  Xy,   . 

•  •,  x^ 

T.. 

K 

■^-21 »    -"-22  >     • 

. . .,  X2j,   . 

•  • »  x2k 

T2. 

x2. 

X^,    Xl2,     . 

. . .,  Xy,   . 

. . . ,  xlk 

Ti. 

Xi. 

Xlj,   Xi2,    . 

.  .  . ,  Xy,    .  . 

•  • ,  X]k 

Ti. 

Xi. 

T.19  T.2,     . 

...,T.j,    . 

..,T.k 

T 

X.!,  X.2,   . 

. . .,  X.j,   . 

•  •»  x.k 

X 

In  the  table,  Ti.  is  the  total  and  Xi.  is  the  arithmetic  mean  of 
the  ith  row;  T.j  is  the  total  and  X.j  is  the  arithmetic  mean  of  the 
j  th  column;  and  T  is  the  total  and  X  is  the  arithmetic  mean  of  the 
whole  sample  of  N  =  Ik  items. 

k  1 

Let  E  (Xu)  =  ay ;  E  (X2y)  =  Ay ;    Sj  ay  =  kau    Si  ai  =  la; 

l    _         _     k    _  _ 

SiX|  =  lX;  SjXj  =  kX. 

i  i 

Then,  by  III.19.1.,  for  the  ith  row 

k  i         i  -,      k 


E 


K.  11 

^(Xy  -  Xi)2  =  -=—  23  (Ail  -  ay2)  +  2  j  (ay  -  ai)2 


III.19.4. 


Summing  III.  19.4.  for  all  the  1  rows,  it  is  found  that 


E 


r    l 

I 

.    l 


SiS,^-* 


k  — 1 


k 


EiS^— «* 


i 

1  k 


+E,Ej<a"-ai): 


III.  19.5. 


STANDARD  DISTRIBUTIONS  99 

Since   E  (Xi.)  ;=  ai,  we  note  that 
E  (Xi.-  ai)2  =  E  (Xi.2)  -  2  ai  E  (Xj.)  +  ai2  =  E  (Xlt2)  -  ai2   or 
E  (Xi.2)  =  E  (Xj.  —  aj)2  +  a^  III.19.6. 

Applying  III.19.1.  to  Xi.  (i  =  1,  2,  . . .,  1), 


r    l 


E 


S^Xt.-X? 


1-1 

1 


^^(xf.j-ag+^i^-^2 

1  1 

ni.19.7. 


But 


so  that 
r   i 


1     k 

E  (X| .)  -  a?  =  E  (Xj.  -  ai)2  =  -  g ,  (A«  -  afj) 


E 


^/Xi.-x)2]  J—1  2t  2/A«-^  +E.  ^-^ 

1  J        m       1       1  1 


III.19.8. 


E 


Applying  III.  19.1.  to  the  N  =  Ik  values,  we  get 


1        1 


m  ^i  ij,^-* 


ik 


i     i 

1         k 


+SiSj(a«-a; 


1  lf 


III.19.9. 


By  starting  with  the  jth  column  and  proceeding  as  in" III.  19. 5. 5 
III.19.6.,  and  III.19.7.,  it  is  found  that 


E 


k  1 


1  1 


1—1 


^^x^x,)2  =_  2,2^-4) 

+  SiSi(a«-bJ)2  ni.i9.io. 


and 


E 


E<x-'-x>J 


k  — 1 
kl2 


i       i 


k  1 


V    ^(Aij-afjJ+VCbj-a)2 

i 

111.19.11. 


i        i 
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If  the  N  =  Ik  values  are  statistically  homogeneous  or  are  all 
observations  from  the  same  population,  then  Aij  =  A,  aij  =  at 
=  bj  =a  so  that  III.19.5.,  III. 19.8.,  III. 19.9.,  and  111.19.10., 
and  III.  19. 11.,  become,  respectively 

1  k 


E 


SiS,^-^' 


k  — 1 

=  — —  •  Ik  (A—  a2)  =  1  (k  —  1)  (A—  a2) 


k 


111.19.12. 


E^fXi-Xy 


1—1 


Ik2 


•  Ik  (A  — a2) 


1-1 


(A  — a2)    111.19.13. 


1         k 


Ik— 1 
Ik 


lk  (A—  a2)  *=  (Ik  —  1)  (A—  a2) 
111.19.14. 


E 


111.19.15. 
^^X.j-X^^.lkCA-a^^^CA-a2)    111.19.16. 

To  summarize,  it  has  been  shown  that  in  a  statistically  homo- 
geneous set  of  N  =  lk  observations  arranged  in  1  rows  and  k 
columns,  the  following  estimates  of  variance  (or  the  following 
mean  sums  of  squares)  all  have  the  same  expected  value: 

SjljfXy-X)"  SiSifXj,  -X..JF 


(1)  ^— 7—  (2) 


(3) 


lk  — 1 


|,S,(Xa-Xj)F 

k(l-l) 


l(k-l) 


k  S,(X, -X)» 


111.19.17. 


(4) 


1  —  1 


(5) 


is^Xi-x)* 


Any  significant  differences  between  the  estimates  given  in 
III.  19. 17.  indicate  lack  of  homogeneity  of  the  set  of  items.  The 
tests  for  this  will  be  described  in  Chapter  IV. 
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Let  us  now  consider  several  special  cases.  Let  pu  be  the  prob- 
ability that  X  has  the  value  Xij  and  let  pi  be  the  average  probability 
for  the  ith  set,  then 

k  1 

kpi^Sjpu;       lp  =  SlPl 
and  it  can  be  shown  by  the  use  of  III.  19.1.  that 

SiSj  (Xy-X)«  =  lkpq-  SiSj  (pij-pi)2  +  (k2-k)  Si(Pi-p)2 

111.19.18. 

The  special  cases  are : 

(1)  Bernoulli  series:  py  =  pi  =  p.  Here  III. 19. 18  becomes 

SiS^Xy-XJ^lkpq 

(2)  Lexis  series:  p.j  =  p.;  pi^  p.  Here  111.19.18.  becomes 

Si|J(Xij-X)2==lkpq  +  (k2-k)Si(pi-p)2. 

(3)  Poisson  series:  py^=  pj;  pi  =  p.  Here  III.  19. 18.  becomes 

lk  Ik 

Si  Sj  (Xjj  -X)2  =  lkpq-  Si  Z,  (pij  - p)2 
The  special  cases  expressed  verbally  are: 

(1)  Bernoulli  series:  The  underlying  probability  p  is  constant  from 
trial  to  trial  and  set  to  set  or  is  constant  throughout  the  whole 
field  of  observation  and  we  have  statistical  homogeneity. 

(2)  Lexis  series:  The  probability  is  constant  from  trial  to  trial 
within  a  set  but  varies  from  set  to  set  and  we  do  not  have  sta- 
tistical homogeneity. 

(3)  Poisson  series :  The  probability  varies  from  trial  to  trial  within 
a  set  of  k  trials,  but  the  several  probabilities  for  one  set  of  k  trials 
are  identical  to  those  of  every  other  of  1  sets  of  k  trials  and  we  do 
not  have  homogeneity. 

Illustrations  of  such  series  exist  in  the  study  of  traffic  on  a  given 
route  at  1  different  crossings  at  k  different  times  with  a  total  of 
N  =  lk  observations. 
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HI. 20.  The  Multinomial  Distribution:  Let  samples  of  size  n  be 
drawn  from  a  specified  universe  with  each  sample  divided  into 
the  Tc  classes  or  cells  with  the  distribution  random  among  these 
classes  or  cells. 

The  probability,  P,  that  there  are  f01  individuals  in  the  first  cell, 
f02  in  the  second  cell,  and  so  forth,  is 

P  =  tt/oi  ^02 ^ok  n< _  III.20.1. 

where  tzx  is  the  probability  that  an  individual  falls  in  the  first  class 
or  cell,  7r2  the  probability  that  it  falls  in  the  second  cell,  and  so 
forth;  and 

n! 


WW V 

is  the  number  of  combinations  of  n  things  taken  f01  of  one  kind, 
f02  of  another  kind,  fok  of  the  k-th  kind. 

To  illustrate:  At  an  intersection  point  it  has  been  determined 
that  the  probability  of  turning  left  is  |-,  of  going  straight  ahead  is 
|-,  and  of  turning  right  is  j^.  Of  6  vehicles,  what  is  the  probability 
that  one  will  turn  left,  two  will  go  straight  ahead,  and  3  will  turn 
right? 

Solution:  Here    tz1=  -|  =  0.4,  tz2  =  ±  =  0.5,  and  7i3  =  ^  =  0.1. 

Also  f01  =  1,  f02  =  2,  f03  =  3.  Substituting  these  values  in  III.20.1., 

6! 
P  =  (0.4)1  (0.5)2  (O.I)3——-— 
1 !  2 !  3 ! 

=  0.0001  (60)  =  0.006 
which  means  that  6  times  in  1000  the  event  will  happen  as  speci- 
fied. 

Let  us  now  assume  that  each  f0l  (i  =  1,  2,  . . . ,  k)  is  large.  Then, 
by  the  use  of  Stirling's  asymptotic  approximation  to  the  factorials 
in  III. 20.1.,  it  is  found  that 

,       ,  #  nn+^e-nV2^ 

P  =  7Unf01  ^^02      .         -rc/Ok  = 1 ■ ^=. 

1       2  k    f01foi^e-foiy27u...f0kfok+ie-foky^r 

III.20.2. 
where  the  symbol  =  means  "approximately  equal  to". 
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k 

Since  Si  f 0i  =  n,  it  is  not  hard  to  show  that 


i    L01 


P  ^  Kf +*  K)f02+* Mfok+*     ni.20.3. 

\  f 01  /  \  f  02  /  \  fok  / 

Now,  let  fti  =  nTTi  (i  =  1,  2, ,  k)  and 

Xi=f01ZnTCi  =  fj01-ft1  m2()4 

1/nTu,      ys 

fori  =  1,2, ,k. 

Substituting  from  III.20.4  in  III.20.3,  and  transforming  to 
logarithms,  it  is  found  that 

log  P  -  log  K  =  2,  {(foi  +  i)  log  |4 

i   '  fti  +  xi|/fti 

=  ~  2,  &'  +  *  +  Xl  V®  log  (l  +  5=)   IH.20.5. 

It  is  next  assumed  that  fti  and  f 0i  for  each  i  are  of  the  same  order 
of  magnitude.  It  then  follows  that  Xi  will  be  small  compared  with 
fti.  Expanding  the  logarithm  in  III. 20.5  into  a  series,  we  have,  to 
first  order, 

k  _    /Xi  XJ\ 

log  P  -  log  C  =■  -  St  (fti  +  J  +  Xi  1/fti)  ly==  -  i  ^ I        III. 

=  -Si{jxf  +  Xiyi^} 

k  k 

But  Si  (Xi  j/fu)  =  Si  (f oi  -  fti)  =  n  -  n  =  0. 


20.6. 


Hence 


k 

logP-logC^-JSiX?  and 


i 

k 


-*fixi2  III.20.7. 

P  =  e    l 

From  III.20.7,  it  is  clear  that  P  varies  directly  as  the  sum  of  k 
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normal  independent  variates  of  unit  variance  which  are  subject  to 

k 

the  single  constraint  that  2i(Xi  ]/f^)  =  0. 

This  is  precisely  x2  (Chi-square)  as  will  be  seen  in  Chapter  IV. 

Hence,  f=VxhV   *Si±?  IIL20.8. 

i  i  l  «        f« 

and  is  the  probability  of  the  sum  of  the  squares  of  (k  —  1)  in- 
dependent normal  variates  each  of  unit  variance. 

The  criterion  given  in  III. 20. 8  is  known  as  the  Chi-square  test  of 
goodness  of  Jit  and  is  useful  in  testing  the  hypothesis  that  a  sample 
at  hand  came  from  a  universe  of  specified  type. 

The  algebraic  form  of  the  distribution  of  x2  is 

P(X2)  =    s=i      l-ir      rs  e-K(x2)(i)  (k~3)       IIL20.9. 

2~T"; 


2 

Using  the  table  on  page  220  for  this  function  an  application  is 
shown  in  Chapter  V,  page  163. 

Thus  far  the  underlying  probability  of  success  has  been  assumed 
constant.  Suppose  now  that  the  probability  of  success  is  not  con- 
stant, but  depends  on  what  has  previously  happened  such  as  the 
case  of  finding  r  white  balls  from  an  urn  that  contains  np  white 
balls  and  nq  black  balls  when  s  balls  are  drawn  one  at  a  time  from 
the  urn  without  replacements. 

The  solution  of  such  a  situation  is  given  by  the  Hypergeometric 
Distribution. 

III.  21.  Hypergeometric  Distribution :  Consider  an  urn  in  which  there 
are  np  white  balls  and  nq  black  balls.  Draw  s  balls  one  at  a  time 
without  replacements.  The  probability,  Pr,  that  r  (r  =  0, 1,  2, . . . ,  s) 
of  the  s  balls  are  white  is 

(np)!  (nq)! 


p  _       _r!(np  — r)!    (s  —  r)!  (nq  —  s  +  r)! 
4      n! 
s!  (n  —  s)! 
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(np)!  (nq)!  s!  (n  —  s)! 


(np  —  r) !  (nq  —  s  +  r) !  n !  r !  (s  —  r) ! 


111.21.1. 


To  illustrate:  Consider  the  case  of  100  vehicles  approaching  an 
intersection  of  which  np  =  30  are  trucks  and  nq  =  70  are  not 
trucks.  Consider  any  s  =  5  of  these  vehicles  one  at  a  time.  The 
probability,  Pr  p=  P3  that  3  of  the  5  vehicles  are  trucks  is 

30!  70!      5!  95! 

P,  = =  0.117 

3       27!  68!  100!  3!  2! 

which  means  that  117  times  out  of  1000  sets  of  5  vehicles  the  prob- 
ability is  that  3  vehicles  out  of  5  will  be  trucks. 
Now,  let 

x  =  r  +  i    and    y  =  (yr  +  yr+i)/2    and    l-f)        =  yr+1-yr. 

\dx/(x,  y) 

Then, 

^  s+nps-nq-l-~r(n+2)        m  gl  g 

dx(x>y)      *T      (r  +  l)(r  +  1  +nq-s) 
From  y  =  (yr  +  yr+i)/2,  it  is  found  that 

nps  +  nq  +  l-s-r(nq  +  2-np-2s)+2r2 

y  =  i  y. lii.zi.o. 

(r  +  l)(r  +  1  +nq-s) 

Replacing  r  by  x  —  i,  III. 2 1.3.  becomes 
(l\(dA—  2s  +2nps  — 2nq  — 2  — (2x— l)(n  +2) 

\y/  \dx/  ~nps+nq  +  l— s+(x— i)(nq+2— np— 2s)+2(x— |)2 

III.21.4. 

The  equation  given  in  III. 2 1.4.  is  the  equation  of  the  system  of 
curves  which  are  continuous  approximations  to  the  law  of  prob- 
ability given  in  III. 2 1.1. 

The  curves  are  usually  known  as  the  Pearson  system  of  fre- 
quency curves  which  are  the  particular  solutions  of  the  differential 
equation  III. 2 1.4. 

The  equation  III. 2 1.4.,  may  be  written  in  the  form 

'  dy\  y  (x  -fa) 

III.21.5. 


dx/       b0  -f  bt  x  -f  b2x2 
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which  has  12  particular  solutions  or  12  specific  types  of  curves 
dependent  upon  the  values  of  the  constants.11 

The  moments  about  the  arithmetic  mean  of  the  distribution 
m.21.1.,  are 

spq  (n  —  s) 


1*2 


n— 1 


_  spq(q-p)(n  — s)(n  — 2  s) 
^  ~  (n_i)(n_2)  IIL21.6. 

spq  (n—  s)         r 

P*  =  i      iw       9w ii  [n(n+l)-6s(n-s)  +  3pq{n2(s-2)-ns2+6s(n-s)}] 

(n-ij  (Ti-z)  (n—6) 

n[ir+i  =  {(l+E)r-Er}[^~{np+s(q-p)}tx1+{spq(n-s)li0}] 

IIL21.7. 

where  E  is  an  operator  and  means  that 

E{*r  =  (jli+1  (r  =  0, 1,  2,  ....)• 
The  maximum  term  of  III. 2 1.1.  is  approximately5 


y 


III.21.8. 


2  pqs  (n  —  s) 


If  in  III. 2 1.6.  and  III. 2 1.7.,  n  ->  oo  ,  the  respective  moments  be- 
come the  moments  of  the  binomial  distribution  which  shows  that  the 
binomial  or  Bernoulli  distribution  is  the  limiting  case  (or  the  case  of 
a  large  or  infinite  universe)  of  the  hyper-geometric  distribution  (or 
the  case  of  a  finite  universe). 

III.  22.  Correlation6 :  The  theory  of  correlation  is  devoted  to  the  en- 
deavor of  finding  laws  of  relationship  (dependence)  between  two 
or  more  variables.  Suppose  a  group  of  individuals  is  measured  in 
regard  to  a  certain  attribute.  It  is  found  that  the  individuals  differ 
in  their  measurements.  It  is  desired  to  explain  these  differences  in 
terms  of  factors  on  which  this  attribute  is  dependent  and  to  obtain 
laws  connecting  the  attribute  with  one  or  more  such  factors.  The 
better  the  law  of  connection  explains  the  variability  in  the  attribute 
in  question,  the  higher  is  the  correlation. 

To  illustrate :  One  may  wish  to  know  whether  the  height  of  an  in- 
dividual can  be  explained  or  measured  by  the  weight  of  an  in- 
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dividual.  In  other  words,  are  tall  people  heavy  and  short  people 
not  heavy.  It  is  well  known  that  weight  alone  does  not  measure 
height  or  explain  the  difference  in  the  height  of  individuals. 
In  this  instance  there  are  more  factors  than  the  one  factor  weight. 
There  are  three  main  types  of  correlation:  simple  correlation, 
multiple  correlation,  and  partial  correlation.  These  will  now  be  devel- 
oped and  discussed  in  the  order  named. 

The  Correlation  Coefficient  r -Linear  Regression  or  Linear  Trend. 
The  regression  or  trend  line  is  necessarily  the  best  fitting  line  in  the 
sense  of  least  squares.  The  line  may  be  curved  or  straight.  To  start 
with,  let  it  be  assumed  that  the  regression  (trend)  line  is  a  straight 
line.  The  equation  of  this  line  is 

y  =  mx  +b  III.22.1. 

The  values  of  m  and  b  must  be  determined  and  they  are,  respect- 
ively, the  slope  and  y-intercept  of  the  line.  The  x  and  y  values  are 
observed  in  pairs  and  they  are  the  coordinates  of  any  point  on  the 
line.  The  formula  III.22.1.  describes  an  infinite  number  of  lines, 
each  with  its  m  as  well  as  its  b.  No  two  different  lines  have  the  same 
m  as  well  as  the  same  b.  If  the  lines  are  parallel,  they  have  the  same 
m  but  different  b  's.  If  the  lines  pass  through  the  same  poiut  on  the 
y-axis,  they  have  the  same  b  but  different  m's.  We  assume  that 
any  one  of  the  possible  lines  has  the  same  weight  as  any  other  one 
in  arriving  at  a  particular  line,  namely,  the  line  that  fits  the  data 
best  in  the  theory  of  Least  Squares.  The  Principle  of  Least  Squares, 
used  to  determine  the  line  of  best  fit,  states  that  the  line  of  best  fit 
for  a  series  of  values  is  a  line  such  that  the  sum  of  the  squares  of 
the  vertical  distances  from  it  will  be  a  minimum.  There  can  ob- 
viously be  only  one  line  having  this  qualification.  Another  such 
line  exists  for  the  horizontal  distances.  However,  the  one  for  ver- 
tical distances  is  sufficient  for  most  practical  purposes. 

In  Figure  III. 4.,  suppose  that  the  line  RR'  is  the  straight  line 
of  best  fit  for  the  plotted  points  (scatter  diagram)  shown,  and  that 
its  equation  is 

y  =  mx  +  b  III.22.1. 
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The  y-distance,  namely,  y',  of  any  point  (xi,  yi)  from  this  line 
is  equal  to 

yi  —  (mxi  +  b)  III.22.2. 


FlGURE  III.  4 

Illustration  of  Principle  of  LEAST  SQUARES 


The  sum  of  these  distances  squared  must  be  a  minimum.  Sym- 
bolically, 


d2=Si(mxi  +b-yi)< 
1 

is  to  be  a  minimum.  This  necessitates  that 

ad 


and 


FromIII.22.4.: 


db 


ad 


=  +  2  2  (mxi  +  b  -  yi)  =  0 


—  =  +2  V  Xi(mxi  +  b-yi)  =  0 
dm  ^-Ji 


n  n 

Hi  yt  =  nb  +  m  Hi  Xi 

i  i 


III.22.3, 


III.22.4. 


III.22.5. 


III.22.6. 
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where  n  equals  the  number  of  cases  or  number  of  points.  From 
III.22.5.: 

S,  xi  yi  =  b  ^  xi  +  m  Sj  Xi2  III.22.7. 

11  1  1 

Equations  III. 22. 6  and  III. 22. 7  are  so-called  "normal"  equa- 
tions for  finding  the  least-square  straight  line.  The  two  equations 
can  be  solved  simultaneously  to  find  the  unknowns  m  and  b.  These 
two  equations  are  all  that  are  needed  to  determine  the  equation 
of  the  line  of  best  fit.  This  line  gives  the  relationship  between  the 
two  variables  x  and  y. 

The  procedure  can  be  illustrated  by  an  example.  The  required 
calculations  can  be  done  quite  rapidly  with  tables  and  a  calculating 
machine. 

Example :  Given  the  associated  pairs  of  values  for  x  and  y : 
x:  3,  5,  8,  12,  17,  23,  30 
y:   1,  2,  6,  23,  40,  50,  60 
Using  these  values  in  equations  III. 22. 6  and  III.22.7,  it  is  found 
that 

182  =  7  b  +  98  m 
3967  =  98  b  +  1960  m 

Solving  these  equations  for  b  and  m,  we  find  that  m  ==  2.41  and 
b  =  —  7,78  whence 

y  =  2.41  x—  7.78  III.22.8. 

is  the  equation  of  the  best  fitting  straight  line.  From  III. 22. 6 

mx  +  b  —  y  =  0  III.22.9. 

The  equation  III.22.9.  expresses  the  fact  that  the  linear  function 
(straight  line)  passes  through  the  point  whose  coordinates  are 
(x,y). 

Now  measure  all  the  x's  and  y's  from  their  respective  means  as 
origin  and  replace  every  x  by  its  deviation  x'  from  x,  and  y  by  its 
deviation  y'  from  y.  Then  III. 22.9.  becomes,  since  b  now  is  zero, 

y'=mx'  111.22.10. 

and  III. 2 2. 7  becomes 

m2ixi2-2ix;yi  =  0 

i  i 
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from  which 

n 


It  follows  that 


whence 


np        p 
m=^ =^  =  5  m.22.11. 

Six?       nCTx     CTx 
1 


yx-y=^(x-x)  IIL22.12. 


It  is  important  to  note  that  yx  is  the  computed  value  of  y  for  a 
given  x  from  the  equation  of  the  least-square  line.  For  the  line  to 
be  a  regression  (trend)  line,  it  is  necessary  that  yx  is  the  arithmetic 
mean  (or  close  to  being  so)  of  the  values  of  y  associated  with  a 
given  value  of  x. 

Similarly 

xy-x  =  4(y-y)  m.22.i3. 

Gy 
The  coefficient  p/ct|  gives  the  deviation  in  y  from  the  mean  y  cor- 
responding to  unit  deviation  in  x  from  the  mean  x,  for  when 
x  —  x  =  1,  yx  —  y  =  p/d|.  Likewise,  p/ay  gives  the  deviation  in 
x  from  the  mean  x  corresponding  to  unit  deviation  in  y  from  the 
mean  y. 

But,  in  general,  p/cry  =f=  p/<r|.  This  demands  the  necessity  of 
altering  the  unit  of  measure  so  that  unit  change  in  x  and  y  are  of 
the  same  magnitude.  Then 

yx~~y         P  111.22.14. 


Oy  <?x  <7y  \     ffx 

and  5=I__»_£=Z] 


Next,  write 


<JX  <TX  <Ty 

<7X(Ty 


-1 
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the  coefficient  of  correlation.  Hence 


and 


yx—  y  =  r-(x  —  x) 

Cx 


xy—  x  =  r-(y  —  y) 

C7y 


111 


111.22.16. 


111.22.17. 


which  are  the  regression  (trend)  lines.  The  numbers  r  —  and   r  — 

are  called  the  coefficients  of  regression  or  of  the  trend. 
Consider 

Gy  —  CTy 

Jx—  y  =  r—  (x  —  x)    or    y'=r  —  x'. 

<*X  C7X 


Then 


d=X(*-'2*'' 


=  2  y?-2r^2  *'»/,  +r^Vx''2 
*?  a*    il  °  x    1 

=  n^y-2r^(nr0ycx)+r2-^(no*x) 

Gx  Cx 

=  na2y(l  —  r2)  111.22.18. 

Since  d  being  the  sum  of  squares  is  positive,  we  have 
na2y(l—  r2)>0    and 

—  1  ^  r  ^  1  111.22.19. 

and 


r  =  ±  1  when 


yi/ 


x', 


C7y 
<?X 


Now 


Hence 


np=21x'iy/i  and  x,1  =  x1  —  x;y'i  =  yi  —  y. 


np  =  2i (Xi ~~  x)  (yt — y)  =  2  i  (Xl  yi)  -  n  ^  y  • 


112         STATISTICS  AND  HIGHWAY  TRAFFIC  ANALYSIS 
Hence 


Si  xt  y, 
P  -- xy 


But 


n 


Hence 


r  =  ■ 


^x^y 


Si  Xi  yi 
1 

n 


r  = 


xy 


<TX  <Ty 


n 

Si  Xi  yt 


xy 


Si(xi-x)(yi-y)      SiX!"yf"       Si  xt'  yi' 


n  <7X  <7y 


n 


n  <7X  <Ty 


111.22.20. 


From  this  relation,  it  is  fairly  clear  that  r  may  be  considered  as 
the  cosine  of  the  angle  between  two  vectors  in  Euclidean  n  space. 
Again,  from  this  fact,  it  follows  that  —  1  <£  r  <£  1.  Also,  r  is  the 
arithmetic  mean  of  the  products  of  the  deviations  of  the  corres- 
ponding values  from  the  respective  arithmetic  means  when  mea- 
sured in  standard  deviation  units ;  also,  r  is  sometimes  called  the 
product-moment  coefficient. 

The  formulas  useful  in  finding  the  value  of  the  coefficient  of  cor- 
relation are  as  follows : 

(1)  If  the  variables  are  in  original  units  with  respect  to  their 
natural  origin,  then 


ZiXjY, 


-XY 


III.  22.  21. 


CTx  <7y 
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(2)  If  the  variables  are  referred  to  a  class  mid-point  as  an  origin 
and  in  terms  of  the  class  interval  as  a  unit,  then 


n 


Si  xi  yi 


n 
r  =  


x  y  m.22.22. 


These  formulas  are  readily  obtained  algebraically  from  III.  22.  20. 

To  interpret  r,  it  is  necessary  to  use  r2  which  is  called  the  deter- 
mining coefficient. 

If  r,  say,  equals  0.70,  we  find  that  r2  =  0.49  which  means  that 
49  per  cent  of  the  variability  in  the  y-values  is  determined  or 
explained  by  the  potential  determining  or  measuring  factor  x  and 
the  linear  theory  connecting  y  with  x.  In  other  words,  the  theory 
used  or  tested  is  but  49  per  cent  efficient  as  an  estimator  or 
forecasting  or  predicting  theory. 

III.  23.  Basic  theory  of  correlation.  To  explain  the  Basic  Theory  of 
Correlation  let  us  suppose  that  we  have  given  n  pairs  of  values  for 
the  variables  x  and  y.  The  problem  is  to  determine  the  nature 
and  degree  of  the  dependence  between  the  x  values  and  their 
corresponding  y  values. 

To  determine  the  amount  of  interdependence  that  exists  be- 
tween the  pairs  of  variables  it  is  convenient  to  represent  them  by 
points  in  a  two  dimensional  Euclidean  manifold  (scatter  diagram). 
To  facilitate  a  description  of  the  dependence  we  partition  the  data 
into  classes.  This  is  accomplished  by  selecting  class  intervals  of  size 
dx.  We  recall  that  the  set  of  y  values  associated  with  a  given  value 
of  x  on  an  interval  of  size  dx  is  called  an  x  array  of  y's.  If  it  is  de- 
sired to  describe  the  behavior  of  the  expected  values  of  the  y  val- 
ues associated  with  the  x  values,  it  is  necessary  to  find  the  equation 
of  the  curve  y  =  f  (x)  that  passes  through  these  points.  This  curve 
is  known  as  the  estimate  of  the  true  regression  curve.  The  limiting 
curve  that  is  approached  as  dx  tends  toward  zero  is  the  true 
regression  curve  (trend)  of  y  on  x  and  is  actually  the  locus  of  the 
arithmetic  mean  of  arrays  of  y  values  of  the  theoretical  distribu- 
tion as  dx  tends  toward  zero.  The  description  of  the  theoretical 
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law  of  behavior  appertaining  to  the  arrangement  of  y  is  the  solu- 
tion of  the  problem  of  statistical  dependence  (regression  or  trend 
analysis)  of  y  on  x. 

To  illustrate'.  Consider  the  related  value  of  minimum  spacing, 
center  to  center  in  feet,  with  speed  in  miles  per  hour. 

Table  III. 4.  is  a  correlation  table  which  shows  numerically  as 
well  as  graphically  the  two-way  distribution  connecting  minimum 
spacing,  center  to  center  in  feet  with  speed  in  miles  per  hour  as 
found  by  actual  observation.  The  first  question  to  be  answered  is : 
How  dependent  upon  the  speed  of  a  vehicle  is  the  minimum 
spacing  ?  The  answer  to  this  question  is  found  in  interpreting  the 
value  of  the  determining  coefficient  which  is  the  square  of  the  cor- 
relation coefficient. 

Substituting  in  III. 22. 22  the  required  values  from  Table  III.. ,4 
it  is  found  that 

S(xy) 


n 
r  = 


xy 


becomes 


47440   /—  3321\  /— 9849\ 
"1336^  ~~  \  1336  /  \  1336  / 


r  = 


1/58771       /— 3321\2 1/113049       /-9849\2 
Y    1336  ~~  \    1336  /    f     1336         \  1336   / 

35.509  —  (—  2.486)  (-  7.372) 


j/44.090— 6.180  j/84.618  — 54.346 
35.509  -  18.327   17.182 


(6.149)  (5.502)    33.832 

=  0.5079  =  0.51  III.23.1. 

This  result  means  that  (0.5079)2  =  0.2580  =  .26  =  26  per  cent 
of  the  variability  in  minimum  spacing  is  explained  by  or  dependent 
upon  the  speed  of  the  vehicle  and  the  assumed  linear  connection 
between  spacing  and  speed.  In  other  words,  it  appears  that  speed  is 
an  unimportant  or  minor  factor  for  determining  minimum  spacing. 
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law  of  behavior  appertaining  to  the  arrangement  of  y  is  the  solu- 
tion of  the  problem  of  statistical  dependence  (regression  or  trend 
analysis)  of  y  on  x. 

To  illustrate:  Consider  the  related  value  of  minimum  spacing, 
center  to  center  in  feet,  with  speed  in  miles  per  hour. 

Table  III. 4.  is  a  correlation  table  which  shows  numerically  as 
well  as  graphically  the  two-way  distribution  connecting  minimum 
spacing,  center  to  center  in  feet  with  speed  in  miles  per  hour  as 
found  by  actual  observation.  The  first  question  to  be  answered  is : 
How  dependent  upon  the  speed  of  a  vehicle  is  the  minimum 
spacing  ?  The  answer  to  this  question  is  found  in  interpreting  the 
value  of  the  determining  coefficient  which  is  the  square  of  the  cor- 
relation coefficient. 

Substituting  in  III. 22. 22  the  required  values  from  Table  III. .,4 
it  is  found  that 

S(xy)      __ 
— xy 

r  = 


CTX     CTy 


becomes 


47440       /— 3321\  /— 9849 


1336        \    1336  /  \   1336 


r  = 


1/58771       /— 332l\2]/l 
f     1336         \    1336   /    f 


13049       /-9849 


1336        \    1336  /    r      1336 

35.509  —  (—  2.486)  (-  7.372) 


j/44.090— 6.180  j/84.618  — 54.346 
35.509  -  18.327   17.182 


(6.149)  (5.502)    33.832 

=  0.5079  =  0.51  III.23.1. 

This  result  means  that  (0.5079)2  =  0.2580  =  .26  ==  26  per  cent 
of  the  variability  in  minimum  spacing  is  explained  by  or  dependent 
upon  the  speed  of  the  vehicle  and  the  assumed  linear  connection 
between  spacing  and  speed.  In  other  words,  it  appears  that  speed  is 
an  unimportant  or  minor  factor  for  determining  minimum  spacing. 
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This  means  that  either  there  are  several  other  factors  which  to- 
gether would  explain  74  per  cent  of  the  variability  or  that  there 
exists  a  possible  single  other  factor  or  that  the  relationship  is  not 
linear.  Of  these,  it  appears  that  the  former  is  the  most  likely. 

A  second  question  that  needs  to  be  answered  is:  What  is  the 
equation  of  the  linear  law  of  relationship  which  is  useful  to  predict 
the  expected  minimum  spacing  when  the  speed  is  known. 

To  answer  this,  it  is  necessary  to  use  the  regression  equation 
111.22.16,  namely: 

yx— y  =  r—  (x  —  x) 

<7X. 

Substituting  the  values  indicated  by  the  use  of  Table  III. 4.  and 

III.23.1,  it  is  found  that 

22  008 

yx  —  47.0  =  0.508  — (x  —  22.0)  III.23.2. 

J  12.300 

whence 

yx  =  0.909  x  +  27.0 

The  graph  of  this  equation  is  shown  in  Figure  III. 3.  To  illustrate 
the  use  of  III. 23. 2,  suppose  it  is  desired  to  know  the  minimum 
spacing  in  feet  if  the  speed  is,  say,  30  miles  per  hour.  To  answer 
this  question,  substitute  30.0  for  x  in  equation  III. 23. 2,  whence 
the  minimum  spacing  yx  is  found  to  be  54.3  feet.  This  means  that 
the  expected  minimum  spacing  center  to  center  in  feet  or  on  the 
average  the  minimum  spacing  center  to  center  in  feet  is  54.3  feet 
when  the  speed  is  30.0  miles  per  hour. 

A  very  important  question  now  to  be  answered  is :  How  typical 
or  reliable  is  the  expected  minimum  spacing  of  54.3  feet.  This 
question  will  be  answered  in  article  III. 25. 

III.  24.  Coefficient  of  Regression :   Consider 

n 

f  =  Si  nXl  (ynxj  —  mxi  —  b)2 
For  f  to  be  minimum 

—  =  0     and     —  =  0.  III.24.1. 

am  db 


116         STATISTICS  AND  HIGHWAY  TRAFFIC  ANALYSIS 
From  equations  III. 24.1., 

n  n 

Si  nXl  ynXlxi      Xi  nXl  Xi  ynXl/n 
m 


n 

Si  nXl  x? 

n 

Si 

l 

nx,  xf/n 

n 

2i(xiyi)/n 

rC7X(Ty 

4 

<*X 

III.  25.  Standard  Deviation  of  Arrays : 

Consider 

nSy2  =  ^(ji-t^xX 

n  <?v  n  <?y  n      n 

=  2iyi2-2r-^2i(y1x1)+r2^2ix 

1  (7X   1  <7X    1 

=  n  (Ty  —  2  nr2  cy  +  nr2  a2 
=  na2(l-r2) 

Hence: 
S^=ay(l-r2)  ni.25.1. 

Sy  may  be  regarded  as  a  sort  of  average  value  of  the  standard 
deviations  of  the  arrays  of  y's  and  is  sometimes  called  the  root- 
mean-square  error  of  estimate  of  y,  or  more  briefly,  the  standard 
error  of  estimate  of  y.  The  factor  (1  —  r2)^  is  called  the  coefficient 
of  alienation  or  the  measure  of  the  failure  to  improve  the  estimate 
of  y  from  the  knowledge  of  correlation. 

If  Sy  is  regarded  as  a  function  of  x,  say  S  (x),  the  curve 

y  =  S  (x)  gv 
is  called  the  scedastic  curve.  Its  ordinates  measure  the  scatter  in 
the  arrays  of  y's  in  comparison  to  the  scatter  of  all  the  y's.  If  S  (x) 
is  a  constant,  the  regression  system  of  y  on  x  is  called  a  homosced- 
astic  system.  If  S  (x)  is  not  a  constant,  the  system  is  said  to  be 
heteroscedastic.  For  a  homoscedastic  system  with  linear  regression, 
Sy  =  C7y  (1  —  r2)^  is  the  standard  deviation  of  each  erray  of  y's. 
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Similarly,  for  the  dispersion  of  x  on  y,  we  have  Sx  =  al  (1  —  r2). 
Going  back  to  the  spacing  speed  illustration  given  in  article  III. 22 
where  it  was  found  that  the  expected  spacing  is  54.3  feet  when  the 
speed  is  30.0  miles  per  hour.  To  determine  the  dependability  of  the 
value  found  for  spacing,  it  is  necessary  to  obtain  its  standard 
error  or  its  measure  of  variability.  This  is  given  by  III. 2 5.1, 
namely :  if  Sy  is  the  variance  of  the  expected  values  for  spacing, 
then 

Sy=<7y(l-r2). 

Substituting  the  values  for  a2y  and  r2  found  earlier  in  this  chapter, 
we  find  that 

Sy  =  484.35  (1  —  .2580) 
=  359.39 
whence  Sy  =  19.0 

This  means  that  on  the  average,  when  the  speed  is  30.0  miles 
per  hour,  the  spacing  differs  from  the  expected  spacing  of  54.3 
feet  by  19.0  feet.  Id  other  words,  the  probable  or  expected  spacing 
lies  between  54.3  —  19.0  *=  35.3  feet,  and  54.3  +  19.0  =  73.3  feet 
when  the  speed  is  30.0  miles  per  hour.  It  is  fairly  obvious  that  the 
ability  to  predict  the  spacing  knowing  the  speed  is  very  poor  and 
of  very  little  practical  value. 

III.  26.  Correlation  Ratio:  Non-Linear  Regression:  From  III. 25.  it 
may  be  seen  that 

r2  =  1  —  Sy/o-y  III.26.1. 

If  Sy  r=  0,  r  =  1  and  all  the  dots  on  the  scatter  diagram  fall 

Cy 

exactly  on  the  line  of  regression  y  =  r  — .  If  Sy  =  cry,  r  =  0  and 

the  regression  line  is  of  no  aid  in  predicting  y  from  an  assigned  x. 
Now,  let  Sy  be  the  mean  square  of  the  deviations  from  the  means 
of  arrays.  Then  Sy2  ==  Sy  when  the  regression  is  linear  and 
S'y  =f=  Sy  when  the  regression  is  not  linear.  This  fact  suggests  the 
use  of 

lf*  =  l-§-  HI.26.2. 

(Jy 
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where  Y]yx  is  the  correlation  ratio  of  y  on  x  and  Sy  is  the  mean 
square  of  the  deviations  from  the  means  of  arrays  whether  these 
means  are  near  to  or  far  from  the  proposed  line  of  regression.  For 
linear  regression  of  y  on  x,  we  have  yfyx  =  r2.  Similarly  for  x  on  y, 
we  have 

s'2 

vj|y  =1 1-  III.26.3. 

To  illustrate  the  finding  of  the  value  of  correlation  ratio  which 
actually  is  the  true  measure  of  correlation,  the  procedure  is  to  find 
7]2yx  from  equation  III. 2 6. 2.  where 

ay 
As  was  explained,  (Sy)2  is  the  mean  square  of  the  deviations  from 
the  means  of  arrays,  namely 

(s;)»=f-L!l±M± +f>s?+...+f^    m264 

y  n 

where  f i  is  the  frequency  of  the  ith  vertical  array  -  the  array  when 
x  has  the  value  xi  and  s^  is  the  variance  of  the  ith  array.  From 
III. 2 6.1.,  it  is  clear  that  fj  s2  is  actually  the  sum  of  the  squares  of 
the  deviations  of  the  values  for  the  ith  array  of  y's  from  the  arith- 
metic mean  of  the  ith   array  of  y's. 

Making  use  of  Table  III. 4.,  it  is  found  that,  beginning  with  the 
first  array  of  y's,  namely,  the  array  of  y's  when  x  =  0.95,  then  the 
second  array  when  x  =  2.95  and  so  on.  .  . , 


i±4  = 

f   q2  — 

±2  b2  ' — 

2  (40.5  — 

-23.1)2  + 

1  (44.5  — 

-  27.0)2  + 

1  (36.5- 

-23.1)2  + 

3  (40.5  — 

-  27.0)2  + 

4  (28.5  — 

-23.1)2  + 

4(36.5  — 

-  27.0)2  + 

19  (24.5- 

-23.1)2  + 

6  (32.5- 

-27.0)2  + 

23(20.5- 

-23.1)2  + 

22  (28.5  — 

-  27.0)2  + 

6  (16.5- 

-23.1)2 

24  (24.5  — 

-  27.0)2  + 

=  1355.9 

13  (20.5  — 

-  27.0)2  + 

2  (16.5  — 

-  27.0)2 

=  2364.7 
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Similarly,  it  is  found  that 

f3    s32  =  4108.8  f15  s152  =  59855.0 

f4   s42  =  5272.5  f16  s162  =  33508.7 

f5    s52  =  5489.2  f17  s172  =  45523.0 

f6    s62  =  3891.0  f18  s182  =  49788.0 

f7    s72  =  8295.6  f19  s192  =  14902.0 

f8    g82  —  1069.8  f20  s202  =  19500.7 

f9    s92  =  22976.7  f21  s212  =  6950.7 

fio  Sio2  =  15353.5  f22  s222  =  2578.5 

fn  sn2  =  18564.5  f23  s232  =  2068.6 

f12  s122  =  40986.3  f24  s242  =  7680.0 

fis  S132  =  50938.5  f25  s252  =  37.1 

f14  s142  =  29733.6  f26  s262  =  288.0 

I27   S27      =  U 

Substituting  the  values  of  the  fi  s2  just  found  in  III. 26.1,  it  is 

f0Undthat  453080.9 

(S;)2  = =339.1 

v   y/  1336 

From  Table  III.4,  and  III.23.1  it  was  found  that 

S2  =  16  [S4.618  —  54.346] 

=  16  (30.272)  =  484.4 

Substituting  the  values  just  found  for  (Sy)2  and  Sy  in  III. 26. 2., 

it  is  found  that 


7)2X  =  1  -  -^-a  =  1  -  0.70  =  0.30 


339.1 
484.4 

Previously  in  III. 23.1  it  was  found  that,  on  the  hypothesis  of 
linear  regression,  the  determining  coefficient  r2  =  .26.  If  the  re- 
gression is  not  linear,  we  have  found  that  the  determining  ratio  - 
the  real  and  proper  measure  of  correlation  -  is  0.30.  A  legitimate 
question :  Is  the  difference  between  the  determining  ratio  and  the 
determining  coefficient  large  enough  to  justify  the  rejection  of  the 
hypothesis  of  linear  regression  ?  The  technique  to  answer  this 
question  will  be  shown  in  Chapter  IV. 

The  reader  is  cautioned  not  to  follow  the  usual  practice  of  tac- 
itly assuming  linear  regression  and  in  this  sense  finding  the  value 
of  r2.  The  proper  procedure  is  to  find  vf  first.  Then  it  should  be 
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determined  whether  if  is  large  enough  to  justify  the  obtaining  of 
the  actual  regression  (trend)  function  as  well  as  whether  yj2  is  large 
enough  to  indicate  that  a  significant  correlation  exists.  The  former 
is  discussed  and  shown  in  III. 29.  and  the  latter  in  Chapter  IV. 

In  the  case  just  illustrated  it  is  true  that  t]2  =  0.30  indicates 
real  correlation,  but  it  is  much  too  small  for  predicting  or  estima- 
tion purposes.  It  is  also  true  that  there  are  sufficient  grounds,  as 
will  be  seen  in  III. 29.  to  reject  the  hypothesis  of  linear  regression. 

A  mean  square  of  the  deviations  in  each  array  is  a  minimum 
when  the  deviations  are  taken  from  the  mean  of  the  array.  Hence, 
the  (Sy)2  in  III.26.2.  must  be  equal  to  or  less  than  Sy  in  III. 26.1. 
for  the  same  data,  since  the  deviations  in  III.26.1.  are  measured 
from  the  proposed  line  of  regression.  Hence,  we  have  shown  that 

1  ^  7]yx  ^  r2 
It  foUows  from  III.26.2.  that  y]^  ^  1. 

If  regression  of  y  on  x  is  linear,  y)yx  —  r2  found  from  the  sample 
differs  from  zero  by  an  amount  not  greater  than  fluctuations  due 
to  random  sampling.  A  comparison  of  y]yx  —  r2  with  its  sampling 
error  is  a  useful  criterion  for  testing  linearity  of  regression.  A  better 
and  more  powerful  method,  however,  to  test  linearity  of  regression 
is  by  the  use  of  the  Analysis  of  Variance. 

III.  27.  Multiple  Correlation:  Suppose  we  have  given  N  sets  of  cor- 
responding values  of  n  variables  xLi  x2,  . . . ,  xn.  Now  separate  the 
values  of  xx  into  classes  by  selecting  class  intervals  dx2,  dx3,  . . . , 
dxn  of  the  remaining  variables. 

The  locus  of  means  of  such  arrays  of  x^s  in  the  theoretical  dis- 
tribution, as  dx2,  . . .  dxn  approach  zero  is  called  the  regression 
surface  (trend)  of  xx  on  the  remaining  variables.  We  now  assume, 
for  convenience,  that  any  variable,  xj,  is  measured  from  its  arith- 
metic mean  as  origin.  Let  oj  be  its  standard  deviation  and  let  rPQ 
be  the  correlation  coefficient  of  the  n  given  pairs  of  values  of  xp 
and  xq.  We  now  seek  to  find  b12,  b13,  . . . ,  bln  of  the  linear  re- 
gression surface 

x1  =  b12  x2  +  b13  x3  +  •  •  •  +  bln  xn  +  c  IIL27.1. 

of  xx  on  the  remaining  variables  so  that  xx  computed  from 
III.27.1.  will  give  the  best  estimates  in  the  sense  of  Least  Squares 
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of  the  values  of  x±  that  correspond  to  any  assigned  values  of  x2,  . . . , 
xn.  It  follows  that 


U  =  2  (xj  -  b12  x2  -  b13  x3-  ...  -  bln  xn  -  c)2 


III.27.2. 


shall  be  a  minimum.  This  gives  us  for  the  linear  regression  surface 

n     R1QXq 

X1  =  —  fflSq  — 


R 


III.27.3. 


11  ^q 


where 


R  = 


n> 


12' 


In 


21'     x22> 


**nlJ    -^n2'     •  •  •  >     ^nn 

and  Rpq  is  the  cof actor  of  the  pth  row  and  qth  column  of  R. 

If  the  dispersion  a1#23 . . .  n  of  the  observed  values  of  xx  from 
computed  values  is  defined  as 

erf  .23  . . .  n  =  -  S  (observed  xx  —  computed  x^2         III. 27. 4. 


n 


then,  it  can  be  proved  that 


1-23 


R 


R 


li 


III.27.5. 


We  are  next  interested  in  the  dispersion  of  the  estimated  values 
given  by  III. 2 7. 3.  Since  the  mean  value  of  the  estimates  is  zero, 
when  the  origin  is  at  the  mean  of  each  system  of  variates,  it  can 
be  shown  that 

R 


°Ie  =  *H1-jj 


ii 


III.27.6. 


The  square  of  the  multiple  correlation  coefficient  r^^ . .  .n  of 
order  (n  —  1)  of  x1  with  the  other  n  —  1  variable  is  given  by 

R 


1-23 


n=l 


R 


11 


III.27.7. 


The  analysis  of  data  furnished  by  J.  S.  Ellerby,  Safety  Director, 
Fort  Belvoir,  Virginia  will  serve  as  an  example  of  multiple  cor- 
relation. These  data  consist  of  the  following  information  on  440 
drivers : 

x1  =  Road  Test 

x2  b=  Years  of  Experience 
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x3  =  Reaction  Time 

x4  =  Distance  Judgment 

x5  =a  Driver  Information  (Written  test) 

Let  us  assume  that  the  road  test  is  a  measure  of  driver  ability 
and  let  it  be  our  problem  to  determine  whether  each  of  the  other 
tests  individually  or  collectively  measure  driving  ability. 

The  first  step  is  to  determine  the  simple  correlation  between 
each  of  the  tests.  The  procedure  for  this  is  that  followed  in  the  ex- 
ample of  finding  the  correlation  between  speed  and  minimum 
spacing. 

These  correlations  are  shown  in  Table  III. 5.  Before  using  these 
results  to  obtain  a  multiple  correlation  let  us  consider  the  signifi- 
cance of  these  simple  correlations.  It  is  noted  immediately  that 
none  of  them  is  large  enough  to  be  significant  and  therefore  our 
conclusion  is  that  none  of  the  tests  is  of  value  as  a  measure  of 
driving  ability. 


Table  III.5 
Simple  Correlation  of  Driver  Tests 


(1) 
Road  Test 

(2) 
Years 

(3) 
Reaction 

(4) 
Distance 

(5) 
Driver 

Experience 

Time 

Judgment 

Information 

(1) 

Road  Test 

ru  =  1.0000 

r12  =  .0476 

r13  =  .0257 

r14=.05514 

r15  =  0.2608 

(2) 
Yrs. 

Experience 

r21  =  .0476 

r22  =  1.0000 

r23  =  .006157 

r24=.  00101 

r25  =-0.4603 

(3) 
Reaction 

Time 

r31  =  .0257 

r32  =  .006157 

r33  =  1.0000 

rS4=-.0404 

r35=-.1027 

(4) 
Distance 

Judgment 

r41=.05514 

r42  =  .00191 

r43=-.0404 

r44  =  1.0000 

r45  =  .1568 

(5) 
Driver 

Information 

r51  =  0.2608 

r52  =-0.4603 

r33=-.1027 

r54  =  .1568 

r55  =  1.0000 
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At  least  one  of  the  correlations  is  opposite  to  what  one  might 
expect.  A  driver  with  an  increase  in  experience  apparently  knows 
less  about  driving  since  the  correlation  is  negative  ( — .46).  How- 
ever, since  r2  =  (.462)  =  .21  =  21  per  cent,  only  this  amount  of  the 
variable  in  driving  knowledge  may  be  said  to  be  explained  or  de- 
pendent upon  experience,  consequently  it  may  be  said  that  there 
is  little  or  no  connection  between  driving  ability  and  experience. 

We  would  not  of  course  be  justified  in  concluding  from  this  one 
study  that  drivers'  tests  have  no  value,  for  it  may  be  that  all  of  the 
drivers  tested  are  good  drivers  and  their  visual  acuity,  reaction 
time,  and  other  capabilities  are  well  within  the  safe  range.  For  ex- 
ample, the  total  range  of  reaction  time  was  from  .350  to  .560 
seconds.  A  driver  with  a  reaction  time  much  slower  than  .56  might 
be  an  accident  prone  driver.  It  is  fair  to  say  that  it  is  quite  a  bit 
more  likely  than  not,  however,  that  these  deductions  are  valid. 

The  next  question  to  be  answered  is  that  of  whether  the  tests  as 
a  whole  give  any  indication  of  driving  ability,  i.  e.,  whether  the 


sets  of  data  x. 


and  x5  taken  together  furnish  us  with  a 


measure  of  driving  ability.  To  answer  this  question,  we  make  use 
of  the  theory  of  multiple  linear  correlation.  The  first  step  in  the 
analysis  is  to  find  the  multiple  linear  regression  equation.  This  is 
done  by  substituting  the  values  for  the  r's  from  Table  III. 5,  in 
equation  III. 27. 3.  and  solving  by  determinants. 


["R12 


Jtv1Q  Xr 


R 


+ 


R 


14 


R 


11  °3 


1R 

2R 


12 


11 


1  R 
3R 


13 


11 


Ru<j4 

1R 
~~4R 


R 


L5  X5  j 


14 


11 


1R 
5R 


15 


11 


=  +^ 


Tn 

r23 

r24 

r25 

r31 

r33 

r34 

r35 

*41 

r43 

r44 

r45 

r51 

r53 

r54 

r55 

r22 

r23 

r24 

r25 

r32 

r33 

r34 

r35 

r42 

r43 

r44 

r45 

r52 

r53 

r54 

r55 

Xo  — 


r21 

r22 

r24 

1*25 

r31 

r32 

r34 

r35 

r41 

r42 

r44 

r45 

r51 

r52 

r54 

r55 

r22 

r23 

r24 

r25 

r32 

r33 

r34 

r35 

r42 

r43 

r44 

r45 

r52 

r53 

r54 

r55 
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+  ^ 


*21 

r22 

r23 

r25 

r31 

r32 

r33 

r35 

r4i 

r42 

r43 

r45 

r5l 

r52 

r53 

r55 

r22 

r23 

r24 

r25 

r32 

r33 

*34 

1*35 

r42 

r43 

r44 

r45 

r52 

r53 

r54 

r55 

r*i 

r22 

r23 

r24 

r31 

r32 

r33 

r34 

r41 

r42 

r43 

r44 

1 

r5l 

r52 

r53 

r54 

5 

r22 

r23 

r24 

r25 

r32 

T33 

r34 

r35 

r42 

r43 

r44 

r45 

r52 

r53 

r54 

r55 

=  -9.3287 


-.2722 


.0092 


-.0460     xc 


-.0030 


.7532  11.4434 


.7532  .0452 


.7532  10.2713 


+ 


.7532  2.7367J 
=  -.0016  x2  +  .0253  x3  +  .0036  x4  +  1.2318  x5 . 

The  next  question  that  is  to  be  answered  is  how  reliable  are  the 
expected  values  of  the  x-^s  as  determined  from  the  regression  equa- 
tion when  sets  of  values  for  x2,  x3,  x4,  and  x5  are  known.  The 
square  of  the  multiple  correlation  coefficient  when  properly  inter- 
preted is  the  answer  to  this  question. 

This  is  equation  III. 27. 7 

R 

-=  1- 


1.23 


R 


We  first  find  R  by  substituting  the  values  from  Table  III. 5  for 
its  determinant  and  solving. 


R  = 


51 


52 


53 


x14 
r24 
T34 
r44 

r54 


ri5 
r25 

r35 
r45 


=  .6774 


Therefore,  since  R„  =  .7532  as  determined  above, 


1.2345 


.6774 
.7532 


.8994  =  .1006 


Since  this  value,  .1006  means  that  only  10.06  per  cent  of  the 
variability  in  road  tests  is  explained  by  the  composite  knowledge 
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of  the  factors,  years  of  experience,  reaction  time,  distance  judg- 
ment, and  driver  information,  it  may  be  concluded  that  the  com- 
posite result  of  these  tests  is  practically  worthless  as  a  measure  of 
driving  ability  as  shown  by  the  road  test. 

Another  question  to  be  answered  is  what  is  the  standard  error 
in  the  expected  values  of  x.  This  standard  error  is  a  measure  of  the 
total  variability  that  is  not  explained,  or  in  other  words,  is  not  de- 
pendent upon  the  sets  of  values  of  x2,  x3,  x4,  and  x5. 

The  standard  error  in  the  expected  value  of  x1  obtained  from 
the  regression  equation  III. 2 7. 5  is  equal  to 


'1.2345 


=    G\ 


R 

Rn 


a1.2345    —  CT1 


=  9.3287 


=  88.47  percent 


.6774 
.7532 


Since 


R 


R\         .  R 


\Rii/  \        Rn/  R 


R 


we  may  say  that  the  proportional  part  of  the  total  variability  (cf) 

R 

that  is  not  explained  in  terms  of  x2,  x3,  x4,  and  x5  is =  .8994 

Rn 

=  89.94  per  cent  and  that  the  explained  variability 


R 

=  1— r—  =1 


R 


.8994  =  .1006  =  10.06  per  cent. 


li 


As  a  check : 


R        / 


R 

rT 


ii 


.8994  +  .1006  =  1. 


III.  28.  Partial  Correlation:  Very  often  we  wish  the  degree  of  corre- 
lation between  two  variables  xx  and  x2  when  the  other  variables  x3, 
x4, . . . ,  xnhave  assigned  values.  Thus,  we  define  a  partial  correlation 
coefficient  r12.34  . . .  n  of  xx  and  x2  for  assigned  x3,  x4,  . . . ,  xn  as  the 
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correlation  coefficient  of  xx  and  x2  in  the  part  of  the  population  for 
which  x3,  x4,  .  .  .  xn  have  assigned  values.  A  change  in  the  assigned 
values  may  lead  to  the  same  or  different  values  of  r12>34  . . .  n. 

Assume  that  the  theoretical  mean  or  expected  values  of  xx  and 
x2  for  an  assigned  x3,  x4,  .  .  . ,  xn  are 

r  Xl  =  b13  x3  +  b14  x4  +  . . .  +  bln  xn  l  2g 

\  x2  =  b23  x3  +  b24  x4  +  .  . .   +  b2n  xn  J 
respectively. 

Then,  a  partial  correlation  coefficient  r'12.34.  .  .n  is  the  simple 
correlation  coefficient  of  residuals 

{Xl-34  •  •  •  n  —  xl  D13  X3  "14  X4  •  •  •   —  t>in  xn|         TTT  9      „ 

X2-34  •  .  .  n   =  X2  D23  X3  D24  X4  •  •  ■  D2n  XnJ 

limited  to  the  part  of  the  population  n34  . . .  n  of  the  total  n  for 
which  x3,  x4,  . .  . ,  xn  are  fixed. 

Suppose  further  that  the  population  is  such  that  any  change  in 
the  assignment  of  values  to  x3,  x4,  .  .  . ,  xn  does  not  change  the 
standard  deviation  of  x1>M.  .  .n  nor  of  x2.34.  .  .n  nor  the  value  of 
ri2-34  . .  .  n.  Such  a  population  suggests  that  we  define 

r12.34 ..-.»=    Xl'34-    •"X23'"--n  m.28.3. 


ncr1.34  ■  .  .  n  °2-34  •  •  .  n 

where  the  summation  extends  to  n  pairs  of  residuals,  as  the  partial 
correlation  coefficient  of  xx  and  x2  for  all  sets  of  assignments  of 
X3  ■  •  •  j  xn« 

If  the  population  is  such  that  r'12.34. .  .n  is  not  the  same  for  each 
different  set  of  assignments  of  x3,  x4,  .  . .  xn,  the  right  hand  member 
of  III. 2 8. 3.  may  still  be  regarded  as  a  sort  of  average  value  of  cor- 
relation coefficients  of  xx  and  x2  in  subdivisions  of  a  population 
obtained  by  assigning  x3,  x4,  .  . . ,  xn  or  it  may  be  regarded  as  the 
correlation  coefficient  between  the  deviations  of  x1  and  x2  from 
the  corresponding  predicted  values  given  by  their  linear  equations 
on  x3,  x4,  ...  xn.  It  can  be  shown  that 

ri2-3i---n=^rir^  IIL28-4- 

(Kn  K22)2 
To  illustrate,  we  make  use  of  the  data  for  the  Driver  tests  prev- 
iously given  in  Table  III. 5  and  set  ourselves  the  problem  of  finding 
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the  correlation  between  road  test  and  years  of  experience  under 
the  assumption  that  each  is  influenced  to  some  extent  by  reaction 
time,  distance  judgment  and  driver  information.  If  each  is  thus 
influenced,  the  obtainment  of  the  simple  correlation  coefficient 
between  the  road  test  and  driver  experience,  assuming  the  exis- 
tence of  such  influence,  gives  us  spurious  correlation.  Partial  cor- 
relation between  road  test  and  years  of  experience  is  the  theory  of 
correlation  that  removes  the  influence  of  reaction  time,  distance 
judgment,  and  driver  information.  Substituting  the  probable 
values  of  the  R's  for  III. 28.4,  we  find  that 


—  R 


12 


12.34 


(Rn  R22) 


Wherein  R12  and  Rn  have  the  values  already  determined  and  R 


22 


has  the  value  .8960  found  by  substituting  values   from   Table 
III. 5.  and  solving  the  determinant. 


R 


22 


11 


13 


43 


53 


1*15 

r35 

T45 

r65 


=  .8960 


hence 


ri2-34  — 


-R 


12 


-  .0092 


-.0092       -  .0092 


(RUR22)*      }/(.7532)  (.8960)      J/.6749        .8215 
therefore,  there  is  practically  no  partial  correlation. 


=  -0.001 


III. 29.  Regression  (Trend)  Lines:  Let 

Y^a0-faiX  +  a2X2+  ...  +  apXp 


III.29.1. 


be  the  equation  of  expected  values  of  Y  that  are  associated  with 
the  various  values  of  X.  It  is  desired  to  know  the  values  of  the  a's 
such  that  the  value  of  U  given  by 


U  =  Si  (yi  —  a0  —  ax  Xi  —  a2  x? 

i 


aPx")2 


III.29.2. 


is  a  minimum. 
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This  requires  that 

g  =  ±i<x{yi)-a02:ixi-a1^ix^-...-ap2;ixrI>=0 

*  1  11 

whence 


III.29.3. 


a3 


Aj(p> 
A(p) 


III.29.4. 


where 


A(p) 


Pp+l 


f^-pj  f^p+l'*  •  •>  H-2p 


n,     Sfxx,       2fxxp,     2fxyx 

Sfxx,  2fxx2,     ....  Sfxxp+1,  2fxxy3 


2fxxp,  2fxxp+1,. . . .  Zfxx2*,  Sfxx2Pyx 

III.29.5, 


and  A^p)  is  the  determinant  obtained  by  substituting  the  product 
moments  fx01,  ...  (xpl  for  the  (j  +  l)th  column  in  A(p). 

It  is  not  too  difficult  to  show  that  the  regression  (trend)  equation 
may  be  written  in  the  form 


Y,     1,     X, 


,  xp 

»    H-p 
i    (^p+i 


H-2P 


m.29.6. 


Hpu  ^>    Ht>+i>   •  •  • 

Now  consider 

Y  -  b0  P0  +  bj  Px  +  . . .  +  bp  Pp 

and  demand  that  2  (Pj  Pk)  «=  0  when  j  =f=  k,  where  the  P's  are 
polynomials  in  X,  Pj  being  of  degree  j. 
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Again,  minimizing 

x  =  xn 

y  =  yn 
2   (Y-boPo-^Pj-.-.-bpPp)2  in.29.7. 

x  =  xt 

y  =  yi 

it  is  found  that 

2  (yPj)  -b0  2  (P0Pj). . .  -bp  2  (PpPj)  =  0        III.29.8. 
Since  2  (Pj  Pk)  for  j  =/=  k  is  zero,  III. 29. 8.  reduces  to 

2  (y  Pj)  -  bj  2  (Pj2)  =  0.  IIL29.9. 

Hence  bj  is  simply  determined  by  Pj  and  if  in  fitting  a  curve  of  degree 
p,  it  is  desired  to  proceed  a  step  farther  and  add  a  term  bp+1  Pp+1, 
the  coefficients  b0,  . . . ,  bp  already  found  remain  unaltered.  This 
method  is  known  as  the  method  of  orthogonal  polynomials. 

The  use  of  orthogonal  polynomials  gives  a  convenient  method  of 
determining  step  by  step  the  goodness  of  fit  of  the  regression  line. 
Consider 


U=2(y-b0P0-...-bpP 


\2 


=  2(y2)-2b02(yP0)-...-2bp2(yPP) 
+  b2E(P2)  +  ...+b22(P2) 
But,  from  III. 29.9.,  we  may  express  2  (y  Pj)  in  terms  of  2  (Pj2). 
Hence 

U  =  S  (y2)  -  b2  2  (P2)  - . . .  - b2  S  (P2)  III.29.io. 

This  shows  that  the  effect  of  any  term  bj  Pj  is  to  reduce  U  by 
b2  2  (P2)  and  the  effect  of  this  term  onUis  an  independent  matter. 
Again,  if  it  is  found  that  the  addition  of  any  term  bj  Pj  does  not 
reduce  U  significantly,  the  conclusion  is  that  the  term  is  redundant 
and  therefore  not  necessary  or  that  the  fit  is  good  enough. 

It  is  now  necessary  to  obtain  the  expressions  for  the  various 
orthogonal  polynomials.  To  this  end,  let 

Pp=2jCPjX*  111.29.11. 

In  111.29.11.,  there  are  (p  -f  1)  unknown  constants.  Hence,  in 
all  the  polynomials  up  to  and  including  those  of  order  p,  there  are 
■§•  (P  +  !)  (P  +  2)  constants.  The  orthogonal  relations  up  to  and 
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including  order  p  provide  |  p  (p  +  1)  conditions  on  the  C's.  It 
follows  that  |  (p  +  1)  (p  +  2)  —  |  p  (p  +  1)  =  p  +  1  constants 
are  assignable  at  will.  For  convenience,  take  one  constant  for  each 
P  and  assign  it  so  that  the  coefficient  of  XJ  in  Pj  has  the  value  unity. 
In  other  words,  put 

Cjj  =  1  111.29.12. 

Rewriting  III. 29. 11.,  we  get 

Px  =  Co  +  X 

*  2     =  ^20   H"  C2i  X    +  X 

P3  -=  C3o  +  C31  X  +  C32  X*  +  X*  111.29.13. 

Pp  =  Cp0  +  Cpl  X  +  CP2  X2  +  . . .  +  Xp 
From  the  orthogonal  relations 

2  pp  p0  =  2  Pp  =  0 

SPpPi^O  III.29.14. 


This  system,  III. 29. 14.,  is  equivalent  to 

2Pp  =  0 

2  x  Pp  =  0 

SxPPp  =  0  III.29.15. 

Substituting  the  values  of  the  P's  from  III. 29. 13.,  it  is  found  that 

Cp0  [h  +  Cpi  fXi  +   .  .  .    +  Cp,  p_i  (Xp-!  +  (Xp  =  0 

Cpo  Hi  +  Cp!  (x2  +  . . .  +  Cp,  p_!  (Xp  +  (xp+1  =  0  111.29.16. 


Cpo  (Xp_i   +  Cpi  (Xp  +   •  •  •    +  Cp,  p_i  (X2  p_2  +  y<2  P-l  —  0 

From  these  equations, 

A(P) 

CpJ  =  ^-,  ni.29.17. 

where  A(p~1)  has  the  same  meaning  as  before  and  A^  is  the  minor 
of  the  term  in  the  last  row  and  (j  +  l)th  column  of  A(p).  It  follows 
that 
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P„  = 


A(P-D 


^P-1j    H<Pj 

1,        X, 


.,  (Xp 

•  >  H-p+i 

•  >  (^p-i 
.,  Xp 


111.29.18 


It  is  clear,  because  of  diagonal  symmetry  of  A(p)  that  Cjk  =  Cy. 

111.29.19. 
From  111.29.15. 

2(Pp2)  =  S(xppp) 
and  hence  from  111.29.18.  if  we  multiply  the  last  row  and  sum 

nA(p) 


Likewise 


2(PP2)  = 
S(yPp)- 


A(p-d 

n  A  p(p) 


Finally,  from  III.29.9. 


A(p-d 
Ap(p) 


up-  A(p) 

and  the  problem  is  completed. 

Specifically,  if  ^  =  1,  ^  =  0,  (x2  =  1,  then 


p„=i 


Pi- 


1  0 
1  x 


P2  = 


1    0     1 

0  1     (x3 

1  X  X2 

1         0 
0         1 

=  x 


=  X*-n3X-l 


p3= 


1 

0 

1        H-3 

0 

1 

^3     H<4 

1 

V* 

H-4    ^5 

1 

X 

X2X3 

1 

0     1 

0 

1        H-3 

1 

f*3     (*4 

111.29.20. 


111.29.21. 


111.29.22. 


111.29.23. 
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—  2 7  1         (f*4  -  ^3  —  l)  X3  -  (^5  —  |*4  ^3—  V*)  X2 

+  (lh  V*  -  p\  +  ^4—  V-l)  X+((X5  -  2  (X4  JX3+  (X3)  i 

To  illustrate :  From  Table  III.  4.  the  regression  data  are  obtained 
and  placed  in  the  first  three  colums  of  Table  III.  6. 


Table  III.  6. 


(1) 

(2) 

(3) 

(4)   . 

(5) 

(6) 

(7) 

Yx 

X 

fx 

fXYx 

fxXYx 

fxX 

fxx2 

23.1 

1 

55 

1270.5 

1270.5 

55 

55 

27.0 

3 

75 

2025.0 

6075.0 

225 

675 

30.6 

5 

74 

2264.4 

11322.0 

370 

1850 

30.7 

7 

70 

2149.0 

15043.0 

490 

3430 

39.7 

9 

63 

2501.1 

22509.9 

567 

5103 

35.8 

11 

35 

1253.0 

13783.0 

385 

4235 

38.4 

13 

50 

1920.0 

24960.0 

650 

8450 

40.6 

15 

33 

1339.8 

20097.0 

495 

7425 

47.1 

17 

41 

1931.1 

32828.7 

2009 

11849 

44.9 

19 

37 

1661.3 

31564.7 

703 

13357 

47.8 

21 

51 

2437.8 

51193.8 

1071 

22491 

55.4 

23 

63 

3490.2 

80274.6 

1449 

33327 

54.7 

25 

81 

4430.7 

110767.5 

2025 

50625 

51.0 

27 

45 

2295.0 

61965.0 

1215 

32805 

51.9 

29 

133 

6902.7 

200178.3 

3857 

111853 

55.4 

31 

93 

5152.2 

159718.2 

2883 

89373 

58.4 

33 

109 

6365.6 

210064.8 

3597 

118701 

55.9 

35 

86 

4807.4 

168259.0 

3010 

105350 

59.5 

37 

46 

2737.0 

101269.0 

1702 

62974 

61.0 

39 

49 

2989.0 

116571.0 

1911 

74529 

53.3 

41 

16 

852.8 

34964.8 

656 

26896 

79.1 

43 

11 

870.1 

37414.3 

473 

20339 

60.9 

45 

8 

487.2 

21924.0 

360 

16200 

68.5 

47 

6 

411.0 

19317.0 

282 

13254 

45.8 

49 

3 

137.4 

6732.6 

147 

7203 

48.5 

51 

2 

97.0 

4947.0 

102 

5202 

36.5 

53 

1 

36.5 

1934.5 

53 

2809 

62814.8 

1566949.2 

29430 

850360 
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To  obtain  the  various  regression  (trend)  functions  for  the  data  of 
Table  III.4.,  it  is  necessary  to  compute  the  following  values,  the 
obtainment  of  the  first  four  being  shown  in  columns  (4),  (5),  (6), 
(7)  of  Table  III.6.: 


2fxYx      =        62814.8 

EfxX4        =          917057464 

2fxXYx  *=    1566949.2 

2fxX5        =      32132903385 

2fxX       =        29430 

2fxX6        =  1180837278435 

2fxX2      =      850360 

2fxX2Yx  =            47175422.8 

EfxYx      p=     2867513.03 

2fxXY3x  =        1535815847.1 

2fxX3      =27146214. 

First,  it  is  necessary  to  compute  the  value  of  the  bj's  from 
III.29.22.  These  are  found  to  be  as  follows : 


=  A(g}  =  ItiMl  =  [SfxYx|  =  62814.8 
0      A(0)      Lj  In  I  1336 


47.017 


HI.29.24. 


A(D 


Pi  \h 


n     SfxY^ 

2fxX  2fxXYx 


n        2fxX 

Sfxx    2fxx2 


1336,       62814.8 
29430,  1566949.2 


(1336)  (1566949.2)  -  (29430)  (62814.8) 


1336,   29430 
29430,  850360 


(1336) (850360)       -  (29430) (29430) 


244804567.2 
269956060 


0.909 


111.29.25. 


D2-A(2) 


fti   Pi  Hoi 

Pi     P2     Pll 

P2     1*3     P21 

\h  Pi   P2 

Pi     P2     Ps 

P2     P3     P4 

n        SfxX 

2fxX  ZfxX2 
Sfxx2  Efxx3 


SfxYx_ 
SfxXYx 

2fxX2Yx 


n   2fxX 

SfxX  2fxx2 
EfxX2  ZfxX3 


2fxX2 
ZfxX3 
2fxX* 
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1336, 

29430, 

62815 

29430, 

850360, 

1566949 

850360, 

27146214, 

47175423 

1336, 

29430, 

850360 

29430, 

850360, 

27146214 

850360, 

27146214, 

917057464 

1336 


850360,   1566949 
27146214,  47175423 


29430 


29430,   1566949 
850360,  47175423 


1336 


850360,   27146214 
27146214,  917057464 


29430 


+  62815 


29430,  27146214 
850360,  917057464 

29430,    850360 
850360,  27146214 


+  850360 


29430,   850360 
850360,  27146214 


(1336)  (—24206)  (108)  — (29430)  (55901)  (106) 


(1336)  (42912) 


1176482 
68673633 


(109)  —  (29430)  (39049)  (108) 
+ (62815)    (75801) (106) 
+  (850360)  (75801)  (106) 


=  —  0.01713 


in.29.26. 


b^^ 


Po 

Pi 

P2 

Poi 

Pi 

P2 

P3 

Pll 

|A2 

P3 

P4 

P21 

P3 

P4 

Ps 

P31 

\h 

Pi 

P2 

P3 

Pi 

P2 

P3 

P4 

P2 

Ps 

P4 

Ps 

P3 

P4 

P5 

Pe 

n       2fxX    2fxX2  2fxYx 
S  fxX   2  fxX2  2  fxX3  2  fxXYx 
2fxX2  2fxX3  2fxX4  2fxX2Yx 
2  fxX3  2  fxX4  2  fxX5  2  fxX3Yx 

n       2fxX    2fxX2  2fxX3 
2  fxX   2  fxX2  2  fxX3  2  fxX4 
2  fxX2  2  fxX3  2  fxX4  2  fxX5 
2fxX32fxX4  2fxX5  2fxX<* 

111.29.27. 


Note:  To  evaluate  determinants,  the  reader  is  referred  to  "A  Text- 
book of  Determinants,  Matrices,  and  Algebraic  Forms,"  by 
W.  L.  Ferrar,  Oxford  University  Press,  1941. 
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Next,  it  is  necessary  to  obtain  the  various  orthogonal  poly- 
nomials. They  are 


Pi 


n     SfxX 
1         X 


1336     29430 
1  X 


n 


1336 


1336  X  — 29430 
1336 


=  X  — 22.03 


111.29.28. 


Po  = 


n 

ZfxX 

1 


SfxX 
SfxX2 

X 


SfxX2 

2fxX3 

X2 


n         SfxX 
SfxX    SfxX2 


SfxX     SfxX2 
SfxX2   SfxX3 


—  X 


n      SfxX2 
SfxX  SfxX3 


+  X2 


n       SfxX 
SfxX  SfxX2 


n      SfxX 
SfxX  SfxX2 


29430   850360 
850360  27146214 


-X 


1336   850360 
29430  27146214 


+  X2 


1336  29430 
29430  850360 


1336  29430 
29430  850360 


75800948420  —  11241247104  X  +  269956060  X2 

269956060 
280.7899  -  41.6410  X  +  X2 


111.29.29. 


The  linear  regression  (trend)  function  is 
Yx  =  b0  +  bjPj  =  b0  +  \  (X  -  22.03) 

=  47.017  +  0.909  (X  —  22.03) 

=  26.99  +  0.909  X  111.29.30. 

which  agrees  with  result  obtained  in  III. 23. 2.,  p.  115  as  it  should. 
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The  quadratic  regression  (trend)  function  is 
Yx-bo  +  biPi+bgPg 

=  471017  +  0.909  (X  -  22.03)  -  0.017 13  (280.7899 
=  22.18  +  1.622  X  —  0.01713  X2 

Likewise 


41.6410  X  +  X2) 
111.29.31 


P*  = 


n       2fxX    2fxX2   SfxX3 

2fxx    2fxx2  2fxx3  2fxx4 

SfxX2  2fxX3  ZfxX4  2fxX5 

IX           X2          X3 

n        2fxX    2fxX2 
2fxX    2fxX2   2fxX3 
2fxX2  2fxX3  2fxX4 

111.29.32. 


Since  the  effect  of  adding  the  second  degree  term  is  rather 
small,  it  follows  that  the  addition  of  the  third  degree  term  is 
negligible  and  redundant.  In  111.29.30.  and  111.29.31.,  Yx  is  the 
probable  or  expected  minimum  spacing  for  a  particular  speed  X. 

Suppose  X  =  10  miles  per  hour,  then  from  III. 29. 30.  we  find 
that  the  expected  minimum  spacing  in  feet  is  Yx  =  Y10  =  36.08 
feet,  and  from  III.29.3L,  we  find  Yx  =  Y10  =  36.69  feet. 

Again,  if  X  =  30  miles  per  hour,  111.29.30.  gives  Ygo  =  54.26 
feet  and  111.29.31.  gives  Y30  =  55.42  feet. 

If  X  =  50  miles  per  hour,  111.29.30.  gives  Y50  =  72.44  feet  and 
111.29.31.  gives  60.45  feet. 

It  is  to  be  emphasized  that  because  of  the  scarcity  of  data  be- 
yond a  speed  of  40  miles  per  hour,  it  is  not  possible  or  scientific- 
ally sound  to  use  the  regression  functions  to  predict  the  minimum 
spacing  beyond  that  speed.  In  any  event,  however,  the  use  of  the 
quadratic  function,  III. 29. 31.,  gives  the  better  estimate  of  the 
minimum  spacing  in  so  far  as  we  are  able  to  use  either  theory.  For 
the  lower  speeds,  III. 29. 30.  gives  an  underestimate  and  for  the 
higher  speeds  an  overestimate. 

It  also  appears  very  likely  that  the  actual  minimum  spacing 
is  not  expressible  in  terms  of  a  single  regression  function.  In  other 
words,  it  appears  that  there  may  be  one  regression  function  for 
lower  speeds  and  a  different  one  for  higher  speeds. 
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CHAPTER  IV 

SAMPLING  THEORY 

Reliability  and  Significance 

IV.  1.  Objective.  In  this  chapter  it  is  proposed  to  show  how  to 
use  the  mathematical  models  of  distribution  that  were  developed  in 
Chapter  III  as  a  basis  for  making  inferences  from  a  limited  number 
of  happenings  that  will  apply  to  all  such  happenings.  This  process 
of  reasoning  from  the  particular  to  the  general  is  known  as  in- 
ductive inference  and  in  a  broader  sense  is  called  sampling  theory. 

Inductive  inference  is  a  means  by  which  scientific  progress 
comes  about.  The  research  worker  obtains  data  through  planned 
experiments  or  through  the  observation  of  natural  happenings 
such  as  the  occurrence  of  accidents  at  certain  types  of  highway 
intersections.  From  the  data  obtained  he  infers  that  certain  things 
are  so.  But  it  is  well  known  that  exact  inductive  inference  is 
theoretically  impossible.  One  of  the  functions  of  statistics  is  to 
provide  techniques  for.  making  inferences  and  for  measuring  the 
degree  of  certainty  of  the  inferences. 

In  order  to  make  the  idea  of  inference  somewhat  more  concrete, 
let  us  suppose  that  we  have  observed  the  speeds  of  one  hundred 
vehicles  at  a  given  location  and  have  found  that  five  were  travel- 
ing over  seventy  miles  per  hour.  We  might  estimate  from  this 
sample  that  five  per  cent  of  all  vehicles  travel  over  seventy  miles 
per  hour,  but  we  would  not  be  very  sure  as  to  the  correctness  of  our 
estimate  for  we  know  that  a  different  sample  of  this  limited  size 
would  undoubtedly  lead  to  a  different  estimate.  At  best  the 
sample  contains  but  partial  information  about  the  law  of  behavior 
of  the  total  population  of  drivers.  Population  is  used  in  its  statis- 
tical sense  meaning  a  collection  of  results  or  objects.  Summary 
numbers  calculated  from  the  sample  accurately  characterize  the 
sample,  but  the  important  question  is,  how  good  are  these  same 
summary  numbers  when  used  as  estimates  of  the  characteristics 
of  the  population  ?  What  is  the  error  committed  by  the  use  of 

138 


SAMPLING  THEORY  139 

sample  characterizing  numbers  in  place  of  the  associated  popula- 
tion characterizing  numbers  ? 

The  role  of  statistics  in  providing  a  measure  of  the  uncertainty 
of  inferences  from  samples  is  confined  to  sampling  errors.  It  must 
be  assumed  that  the  experimenter  has  guarded  against  accidents  in 
recording  the  data.  In  gathering  data  the  first  consideration  is  the 
obtaining  of  a  random  sample. 

IV.  2.  Random  Sampling :  In  order  to  demonstrate  what  is  meant 
by  random  sampling  let  us  suppose  that  we  have  a  given  population 
and  that  the  attribute  or  attributes  of  the  population  to  be  mea- 
sured are  specified.  The  problem  is  to  find  a  sampling  method  for 
the  given  population  and  the  stochastic  variable  being  measured 
that  will  yield  a  random  or  unbiased  sample.  The  answer  lies  partly 
in  theory  and  partly  in  techniques  that  have  been  proven  in 
practice  or  may  have  to  be  devised  to  meet  a  given  situation. 

The  first  requirement  is  that  there  be  no  obvious  connection  be- 
tween the  method  of  selection  and  the  properties  being  studied.  The 
method  and  the  properties  must  be  independent  in  so  far  as  our 
prior  knowledge  enables  us  to  make  them  so. 

To  meet  the  second  requirement  that  the  sample  be  a  random 
selection,  we  rely  on  our  previous  experience  with  a  given  method 
as  well  as  our  intuition  to  justify  its  use  on  new  occasions.  A 
very  reliable  method  of  drawing  random  samples  consists  of  con- 
structing a  model  of  the  population  and  sampling  from  the  model. 

Actually,  randomness  is  largely  a  matter  of  intuition.  The  theory 
of  probability  considers  the  set  of  all  possible  different  samples  that 
may  be  drawn  from  a  specified  universe  and  enables  us  to  derive 
their  distribution  law  for  any  desired  characterizing  summary  num- 
ber. This  theory  requires  that  it  be  made  certain  that  the  sampling 
method  will  tend  to  yield  all  possible  different  samples  with  equal 
frequency.  A  method  that  does  this  is  called  a  random  method. 

IV.  3.  Distribution  of  Sample  Arithmetic  Means.  For  the  purpose  of 
illustrating  the  law  of  the  distribution  of  sample  arithmetic  means, 
let  us  suppose  that  we  have  a  normal  universe,  and  that  from  this 
universe,  we  draw  a  large  number  of  samples  all  of  the  same  size, 
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n.  If  the  samples  are  random  and  drawn  independently,  then  the 
distribution  of  sample  arithmetic  means  is  also  normal.  Further- 
more, the  arithmetic  mean  of  the  distribution  of  sample  arith- 
metic means  is  the  true  arithmetic  mean  of  the  universe  and  the 
standard  deviation  of  the  distribution  of  sample  arithmetic  means 
is  the  standard  deviation  of  the  universe  divided  by  the  square 
root  of  the  size  of  the  sample.  Expressed  symbolically:  If  Xlf  X2, 
X3,  . . . ,  Xi,  . . .  Xk  are  the  sample  arithmetic  means  and  if  X  is 
the  arithmetic  mean  of  the  universe  from  which  the  samples  were 
drawn,  then 

x  =  ?^.  rv.3.1. 

k 

If  a  is  the  standard  deviation  of  the  universe  of  measures  and 
s-  is  the  standard  deviation  of  the  distribution  of  sample  arith- 
metic means,  then 

Si  =  ~ .  IV.3.2. 

The  value  s-  is  frequently  called  the  standard  error  of  the  arith- 
metic mean.  Actually  it  is  the  measure  of  reliability  of  the  arith- 
metic mean  and  is  in  fact  the  expected  error  committed  when  a 
particular  sample  arithmetic  mean  is  used  in  place  of  the  true 
arithmetic  mean  of  the  universe.  The  smaller  the  expected  error, 
the  more  reliable  or  the  more  precise  is  the  sample  arithmetic  mean. 
The  measure  of  reliability  given  by  IV.3.2.  is  exact  in  theory  but 
not  usable  in  practice  because  the  value  of  a  depends  upon  the 
population  which  is  not  known.  Consequently  it  is  necessary  to 
obtain  from  the  sample  an  unbiased  estimate  of  the  universe 
variance  a2,  indicated  by  the  symbol  a2.  This  is  equal  to  : 

a2  =  — —  s2  IV.3.3. 

n  —  1 

where  s2  is  the  variance  of  the  sample.  Substituting  this  value  a2 
for  a2  in  IV.3.2.,  we  obtain 

IV.3.4. 


*       Vn^T 
which  is  usable  as  the  standard  error  of  the  arithmetic  mean 


SAMPLING  THEORY  141 

It  is  to  be  noted  that  IV.3.3.  gives  an  estimate  of  universe 
variance. 

Using  the  data  of  Table  II.  1.  it  was  found  that  the  arithmetic 
mean  was  38.2  miles  per  hour  and  the  standard  deviation,  8.9  miles 
per  hour.  In  11.22.,  page  50,  it  was  also  found  that  the  expected 
speed  of  38.2  miles  per  hour  was  in  error  at  most  23.3  per  cent 
with  a  measure  of  confidence  of  71  per  cent.  To  find  out  how  near 
the  true  value  of  the  arithmetic  mean  our  sample  mean  is,  we 
substitute  in  IV. 3. 4.  and  find  that 

s              8.9 
g_  __  — =  _     =  0.52  miles  per  hour.  IV.3.5. 

x  ]/n— 1  V299 
which  is  the  expected  error  in  the  sample  arithmetic  mean.  In 
other  words,  it  is  68.27  per  cent  certain  that  the  true  arithmetic 
mean  in  the  universe  has  a  value  between  38.2  —  0.5  =  37.7  and 
38.2  +  0.5  =  38.7  miles  per  hour.  (68.27  is  the  per  cent  of  area 
contained  within  one  standard  deviation  on  each  side  of  the 
mean).  In  this  case  the  maximum  expected  relative  error  is 
0.52/38.7  =  1.3  per  cent  with  68.27  per  cent  certainty.  In  like 
manner  it  is  95.45  per  cent  certain  that  the  maximum  relative 
error  does  not  exceed  2.6  per  cent  and  similarly  it  is  99.73  per  cent 
certain  that  the  error  does  not  exceed  3.9  per  cent.  The  conclusion 
then  is  that  the  sample  arithmetic  mean  is  fairly  reliable  (precise) 
but  as  found  before,  it  is  not  usable  as  a  typical  or  characterizing 
speed. 

IV.  4.  Inference  Concerning  Population  Mean.  Let  \l  be  the  popu- 
lation mean  and  X  the  sample  mean.  It  is  desired  to  test  the  hypo- 
thesis: The  sample  whose  mean  is  X  could  have  come  from  a 
population  with  mean  \l.  If  this  is  so,  how  certain  are  we  that 
it  did  ?  This  question  is  answered  by  using  the  t-distribution  where 
in  this  case 


For  example:  Could  our  sample  with  arithmetic  mean  of  38.2 
miles  per  hour  have  come  from  a  population  whose  arithmetic 
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mean  is  40  miles  per  hour  ?  Substituting  the  values  already  found 
in  IV.4.1.,  we  have 

I  38.2  —  40.0  I 

1  =  1.54 


0.52 

Making  use  of  the  t-table  in  "Statistical  Methods  for  Research 
Workers"5  with  in  this  case  n  —  1  =  299  degrees  of  freedom  it  is 
found  that  5  per  cent  of  the  time  the  difference  as  expressed  by  t 
would  be  at  least  1.97.  Only  one  degree  of  freedom  is  lost  because 
the  only  restriction  is  that  the  deviations  are  taken  from  the 
mean  of  the  sample.  However,  our  value  of  t  =  1.54  is  less  than 
1.97.  Hence  it  is  concluded  that  on  the  5  per  cent  level  of  sig- 
nificance we  have  insufficient  grounds  to  reject  the  hypothesis. 
In  other  words,  if  the  hypothesis  is  rejected,  it  would  be  rejected 
when  it  is  true  slightly  more  than  5  per  cent  of  the  time.  This 
means  that  we  would  have  a  slightly  greater  than  5  per  cent 
risk  in  rejecting  the  hypothesis.  To  put  it  in  another  way  the  odds 
are  a  bit  less  than  95  to  a  bit  more  than  5  per  cent  in  favor  of  re- 
jection of  the  hypothesis.  The  level  of  significance  and  risk  are 
synonymous,  for  the  level  of  significance  is  the  probability  that 
the  hypothesis  is  true  and  its  complement  is  the  probability  that 
the  hypothesis  is  not  true. 

IV.  5.  Confidence  Limits.  Since  it  is  impossible  to  estimate  or 
predict  the  true  value  exactly  it  is  necessary  to  obtain  two  numbers 
between  which  the  true  value  will  fall.  These  two  numbers  are 
known  as  confidence  limits.  To  obtain  them,  it  is  necessary  first 
to  determine  the  value  of  t  associated  with  the  relevant  degrees 
of  freedom  (number  of  possible  values  variable  assumes  minus 
number  of  rigorous  conditions  or  constraints  the  values  must  obey) 
and  a  desirable  probability  level  of  significance. 

The  sample  arithmetic  mean  may  be  greater  or  less  than  the 
population  arithmetic  mean.  From  IV.4.1,  it  was  found  that 

IX-ixl 
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It  is  not  hard  to  see  from  this  equation  that   ±  t  =  (X  —  [x)/s~,  or 

(x=X±ts-  IV.5.1. 

which  gives  the  two  values  (confidence  limits)  between  which  the 
true  sample  arithmetic  mean  will  fall.  These  values  are  based  upon 
the  specific  degrees  of  freedom  and  level  of  significance  as  de- 
manded by  the  subjective  problem.  The  limit  of  significance  and 
the  degree  of  reliability  may  be  of  any  desired  value. 

To  illustrate :  Suppose  we  have  a  sample  whose  arithmetic  mean 
is  52,  whose  standard  deviation  is  5  and  whose  size  is  101.  It  is  de- 
sired to  find  the  confidence  limits  on  a  5  per  cent  level. 

Making  use  of  the  t-table  with  (n  — 1)  =  100  degrees  of  freedom 
and  IV.5.1.,  it  is  found  that 

p  =  52  ±1.98  U 

=  52  ±  0.99 
whence  the  two  values  of  [x  are  51.01  and  52.99. 

This  means  that  it  is  95  per  cent  certain  that  the  true  arithmetic 
mean  of  the  universe  lies  between  51.01  and  52.99.  Again,  it  is 
95  per  cent  certain  that  if  we  take  the  arithmetic  mean  of  52  as  the 
value  of  the  population  (true)  arithmetic  mean  the  error  com- 
mitted will  not  exceed  0.99/52  =  .019  =  1.90  per  cent.  If  the 
error  that  may  be  tolerated  (which  is  obtained  from  the  subjective 
material)  is  not  less  than  1.90  per  cent,  then  for  the  pertinent 
purpose  the  sample  arithmetic  mean  may  be  used  as  the  population 
arithmetic  mean.  Otherwise,  it  may  not  be  used. 

IV.  6.  Difference  Between  Sample  Arithmetic  Means.  Frequently  the 
arithmetic  means  are  computed  from  two  independent  samples. 
The  question  that  needs  to  be  answered  is :  Are  these  samples  in- 
dependent and  from  the  same  normal  universe  ?  To  answer  this 
question  we  again  make  use  of  the  t-distribution,  but  in  this  case 
we  use  for  t  the  value  t'  given  by 

I  Y  Y    I 

_  iAi~A2i     IV.6.L 


V 


(N1+N,)(N1B*+N1S| 
(N1N2)(N1+Na-2) 
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where 

Xx  is  the  arithmetic  mean  of  the  first  sample 

X2  is  the  arithmetic  mean  of  the  second  sample 

Sj  is  the  variance  of  the  first  sample 

Sg  is  the  variance  of  the  second  sample 

Nj  is  the  size  of  the  first  sample 

N2  is  the  size  of  the  second  sample 

Nj_  +  N2  —  2  are  the  degrees  of  freedom  and 


1/ 


(N1+N2)(N1S?  +  N2S22)  standard  deviation  of  the 


distribution  of  differences  between  independent  sample  arithmetic 
means  from  the  same  normal  universe. 

To  illustrate :  Suppose  we  have  the  following  two  samples : 


Sample  I 

Sample  II 

Arithmetic  mean 
Standard  Deviation 
Number  of  Individuals 

Xx  :=  145 

Sx  =      5 

N1==    12 

X2  =  150 
S2  =      6 

N2=    20 

We  wish  to  test  the  hypothesis:  The  difference  between  the 
sample  arithmetic  means  is  insignificant,  therefore,  these  two 
samples  are  independent  and  from  the  same  normal  universe. 

To  make  the  test  we  use  IV.6.1.  Substituting  the  given  values 
in  IV.6.1.,  it  is  found  that  in  numerical  value 

I  145  —  150  I  5  5 

f  ^         !  !         =  -— -  =  —  ==  2.35 

'32  [12  (25)  +  20  (36)]       |/4.53       2.13 


f 


240  (30) 

Making  use  of  the  t-table  with  (Nx  +  N2  —  2)  =  (12  +  20  —  2) 
=  30  degrees  of  freedom  it  is  found  that  when  t  =  2.042  the  prob- 
ability that  the  two  samples  came  from  the  same  normal  universe 
is  0.05  and  when  t  =  2.750  the  probability  is  0.01.  The  value  of 
t  =  2.35  lies  between  the  5  per  cent  and  1  per  cent  levels  of  signi- 
ficance, hence,  we  conclude  that  the  two  sample  arithmetic  means 
are  significantly  different  on  the  5  per  cent  level  but  not  so  on 
the  1  per  cent  level.  This  means  that  the  odds  are  between  95  and 
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99  to  between  5  and  1  in  favor  of  rejecting  the  hypothesis  that 
the  two  samples  came  from  the  same  normal  universe. 

It  is  important  to  note  that  if  the  two  means  had  not  been  sig- 
nificantly different  it  would  have  been  necessary  to  investigate 
the  significance  of  the  difference  between  the  variances.  The 
method  of  doing  this  will  be  shown  later. 

If  the  variances  or  the  means,  or  both,  are  significantly  different, 
we  have  grounds  to  reject  the  hypothesis;  but  if  the  variances  and 
means  each  are  not  significantly  different,  we  do  not  have  grounds 
to  reject  the  hypothesis.  This  is  true  because  the  normal  distri- 
bution is  a  two-parameter  family  of  curves. 

IV.  7.  Size  of  Sample  for  Arithmetic  Mean.  Suppose  we  require, 
within  a  specified  degree  of  certainty,  that  the  sample  arithmetic 
mean  shall  differ  from  true  mean  by  not  more  than  a  given  e. 
Consider  again 

t  = "  IV.7.1. 

Sx 

Since  the  error  is  e,  it  follows  that  X  —  \l  =  e.  Hence  IV.7.1. 
becomes 

t  =  -  =    /_,  I 


fys=i 


Rewriting  IV.7.2.,  we  obtain 

N  —  1      s2 


IV.7.3. 


t2  ~~   £2* 

Suppose  we  wish  to  know  the  size  of  the  sample  such  that  it  is 
95  per  cent  certain  that  the  sample  mean  is  within  2  units  of  the 
true  mean  of  the  universe.  In  this  case,  if  the  variance  of  the 
sample  is  100,  s2  =  100,  e2  =  4  and  from  IV.7.3., 

?ti  =    i°°=25 
t2  4 

N  — 1 

From  the  t-table,  it  is  found  that  when   N  =  101, 

N— 1 
=  25.508  and  when  N  =  91,  — — -  =  22.727.    Hence,  the  size  of 

the  sample  is  101. 
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IV.  8.  Reliability  of  Sample  Standard  Deviation.  The  test  for  the 
reliability  of  a  sample  standard  deviation  is  defined  as  x2  (Ghi- 
square)  and  is 

o      NS2 
t  =  ~^  IV.8.1. 

where  N  is  the  size  of  the  sample,  S2  is  the  sample  variance  and  a2 
is  the  population  variance.  Thus  x2  is  the  sum  of  the  squares  of 
N — 1  independent  normal  deviates  divided  by  their  common 
variance. 

This  criterion  is  useful  for  comparing  a  sample  variance  with  a 
population  variance. 

To  illustrate:  Take  a  sample  of  size  10  whose  variance  is  25, 
could  this  sample  have  come  from  a  universe  whose  variance  is  16  ? 

Using  IV. 8.1.,  it  is  found  that 

o       10  (25)       250 

X2  =  — —  = =  15.63 

A  16  16 

From  a  x2  table  for  (N  —  1)  =  9  degrees  of  freedom,  it  is  found 
that  the  probability  of  x2  >  14.684  is  0.10  and  the  probability  of 
X2  >  16.919  is  0.05. 

It  follows  that  a  population  (universe)  having  a  variance  of 
16  could  yield  a  sample  with  variance  of  25  or  more  between  5  and 
10  times  out  of  100. 

Sometimes  it  is  desirable  to  obtain  from  the  sample  an  unbiased 
estimate  of  the  true  universe  variance.  This  is  accomplished  by 
using 

(T2  =  -^— S2  IV. 8.2. 

N— 1 

which  in  this  case  becomes 

a2  ==  —  25=  27.8 
9 

which  means  that  the  expected  value  of  the  universe  variance  is 
27.8  when  the  sample  variance  is  25  and  the  size  of  the  sample  is  10. 
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IV.  9.  Significance  of  Difference  Between  Sample  Variances.  The 
test  here  is  to  determine,  with  respect  to  variance,  whether  two 
samples  are  independent  and  from  the  sample  normal  universe. 
The  criterion  is  the  F-test  which  is  given  by 

S'2 
F  =  -^  IV.9.1. 

N  S2  ,        N  S2 

where  S'2  =  — — —  and  S22  —      2  2    and  the  degrees  of  freedom 

for  S2  is  Nx  —  1  and  for  S2  is  N2  —  1.  Having  two  unbiased  esti- 
mates of  variance,  always  use  for  S'2the  greater  of  the  two  variances. 

To  illustrate:  Let  there  be  given  two  samples  of  10  and  12  indi- 
viduals respectively.  Let  their  variances  be  10  and  5  respectively. 
Are  these  two  samples  independent  and  from  the  same  normal 
universe  ?  In  other  words,  is  the  variance  10  significantly  greater 
than  the  variance  5  ? 

Substituting  in  IV.9.1.,  it  is  found  that  F  becomes 

p  =  NA2  /N2S2   _  10  (10)/ 12  (5) 


N±-  1/  N2-  1  9/11 

=  2.04 

From  the  F-table  with  nx  =  N2  —  1  =  9  degrees  of  freedom  and 
n2  =  N2 — 1  =  11  degrees  of  freedom,  we  find  that  at  the  5  per 
cent  level  of  significance  F  is  2.90  and  at  the  1  per  cent  level 
of  significance  F  is  4.63. 

Hence  we  conclude  that,  since  our  value  of  F  is  2.04  which  is  less 
than  the  F  for  the  5  per  cent  level,  the  larger  variance  is  not  signi- 
ficantly greater  than  the  smaller.  In  other  words,  there  are  not 
sufficient  grounds  to  reject  the  hypothesis  that  the  two  samples 
could  have  come  from  the  same  normal  universe. 

IV.  10.  Significance  of  a  Correlation  Coefficient  The  question  here  is : 
Could  the  sample  whose  coefficient  of  correlation  is  r  have  come 
from  a  non-correlated  universe  ?  We  use 

rl/N^-~2 

IV.10.1. 


1/l-r 
where  the  degrees  of  freedom  are  N  — 
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To  illustrate :  Suppose  we  have  a  sample  of  size  1 1  whose  coeffi- 
cient of  correlation  is  0.60.  Could  this  sample  have  come  from  a 
non-correlated  universe  ? 

Substitute  these  values  in  IV.  10.1.,  and  we  obtain 


0.60  l/ll 
t  = 


]/l  -.36 
1.80 


2.25 


From  the  t-table  with  9  degrees  of  freedom  we  find  that  at  the  5 
per  cent  level  of  significance  t  —  2.262  and  at  the  1  per  cent  level 
of  significance  t  =  3.250.  Hence  we  conclude  that  a  little  more 
than  5  per  cent  of  the  time  the  sample  could  have  come  from  a 
non- correlated  universe  and  a  little  less  than  95  per  cent  of  the 
time,  it  could  not.  In  other  words,  the  odds  are  about  95  to  5  in 
favor  of  rejecting  the  hypothesis  that  the  sample  could  have  come 
from  a  non-correlated  universe. 

In  the  case  of  a  multiple  correlation  coefficient,  if  we  wish  to 
test  whether  the  sample  came  from  a  non- correlated  universe,  the 
criterion  is 

rL3...n/(m-l) 
l-r?.23...n)/(N-m) 

where  m  is  the  number  of  parameters  in  the  regression  function,  N 
is  the  size  of  the  sample  and  Nx  =  m  —  1,  N2  =  N  —  m  are  the 
respective  degrees  of  freedom. 

To  illustrate:  Assume  that  t123  =  0.60  and  that  the  regression 
function  is  a  plane  that  is,  m  —  3  and  that  the  size  of  the  sample 
is  103. 

Substituting  in  IV.  10. 2.,  we  have 

.36/2 

F  = — =  28.1 

.64/100 

From  the  F-table  we  find  that  at  the  5  per  cent  level,  F=  3.09  and  at 
the  1  per  cent  level  F  =  4.82  when  nx  =  m  —  1  =  2  and n2  =  N  —  m 
=  100.  Hence  we  conclude  that  there  are  ample  grounds  to  reject 
the  hypothesis  that  the  sample  came  from  a  non-correlated  uni  verse. 
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To  test  the  hypothesis  concerning  a  partial  correlation  coeffi- 
cient the  procedure  is  the  same  as  that  for  a  simple  correlation 
coefficient  with  the  exception  that  the  number  of  variables  held 
constant  must  be  substracted  from  the  size  of  the  sample  N. 
Hence,  if  k-variables  are  held  constant  the  test  is 

r2   -^/l 

F  = 123k4-'n/ IV.103. 

(l-r?2.sT7s/(N-k-l) 


REFERENCE,  CHAPTER  TV 

1  Yule,  G.  Udney,  and  Kendall,  M.  C,  "An  Introduction  to  the  Theory 
of  Statistics,"  C.  Griffin  &  Co.,  London,  1937. 

2  Croxton,  F,  E.,  and  Cowden,  D.   J.,   "Applied  General  Statistics,'" 
Prentiss-Hall  Inc.,  New  York,  1946. 

3  Rider,  Paul,  "Statistical  Methods"  John  Wiley  &  Sons  Inc.,  New  York, 
1939. 

4  Kendall,  M.  C,  "The  Advanced  Theory  of  Statistics,"  Charles  Griffin 
&  Co.,  London,  1946,  Vol.  I,  page  40. 

5  Fisher,  R.  A.,  "Statistical  Methods  for  Research  Workers"  Oliver  and 
Boyd,  Ltd.,  Edinburgh. 


CHAPTER  V 

SOME  APPLICATIONS  OF 
STATISTICAL  METHODS 


V.  1.  Objective.  This  chapter  illustrates  some  of  the  applications  of 
statistical  methods  to  problems  of  most  interest  to  traffic  engineers. 
Usually  a  statistical  approach  is  more  rational  than  any  other  and  leads 
to  a  better  understanding  of  the  factors  involved.  The  methods 
apply  to  all  types  of  traffic  problems,  but  first  we  shall  study  those 
that  have  to  do  with  highway  capacity.  These  problems  are  of 
primary  concern,  for  they  are  connected  with  the  main  purpose 
of  a  highway  which  is  to  serve  traffic. 

V.  2.  Confusion  As  to  Meaning  of  Highway  Capacity.  Before  attempt- 
ing any  analysis,  it  is  necessary  that  certain  terms  be  defined.  There 
is  some  confusion  as  to  what  is  meant  by  highway  capacity.  This 
is  brought  out  by  the  Highway  Capacity  Manual1,  which  states 
that  the  term  perhaps  most  widely  misunderstood  and  impro- 
perly used  in  the  field  of  highway  capacity  is  the  word  capacity 
itself.  Considerable  work  went  into  the  preparation  of  this  manual, 
and  it  offers  the  most  authentic  and  complete  information  extant 
on  capacity.  In  Part  I,  Definitions,  is  found  the  statement  that 
"the  term  capacity  without  modification,  is  simply  a  generic  ex- 
pression pertaining  to  the  ability  of  a  roadway  to  accommodate 
traffic."  The  manual  gives  three  levels  of  capacity: 

1.  Basic  Capacity:  "The  maximum  number  of  passenger  cars 
that  can  pass  a  given  point  on  a  lane  or  roadway  during  one 
hour  under  the  most  nearly  ideal  roadway  and  traffic  con- 
ditions which  can  be  attained." 

2.  Possible  Capacity:  "The  maximum  number  of  vehicles  that 
can  pass  a  given  point  on  a  lane  or  roadway  during  one  hour 
under  the  prevailing  roadway  and  traffic  conditions." 

3.  Practical  Capacity:  "The  maximum  number  of  vehicles  that 
can  pass  a  given  point  on  a  roadway  or  in  a  designated  lane 

150 
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during  one  hour  without  the  traffic  density  being  so  great  as 
to  arouse  unreasonable  delay,  hazard,  or  traffic  conditions." 
Prevailing  roadway  conditions  include  roadway  alignment, 
number  and  width  of  lanes. 

From  a  practical  standpoint,  speed  should  be  included  in  any 
definition  of  traffic  capacity.  The  driver  is  interested  primarily  in 
the  amount  of  time  it  takes  him  to  arrive  at  his  destination.  Perhaps 
capacity,  meaning  vehicles  per  hour,  should  be  supplemented  by  a 
dimensionless  index  number  similar  to  the  Reynolds  number  in 
hydraulics.  This  number  would  indicate  critical  limits. 

Since  the  term  capacity  has  a  variable  meaning,  we  shall  in  most 
cases  use  the  word  volume  arid  define  it  as  the  number  of  vehicles 
passing  a  given  point  per  unit  of  time.  Density  will  refer  to  the 
number  of  vehicles  in  a  given  length  of  lane.  With  these  definitions, 

Average  Volume  =  Average  Density  times  Average  Speed. 

V.  3.  Theoretical  Maximum  Capacity  (Volume).  The  amount  of 
traffic  per  unit  of  time  depends  on  the  speed  and  the  spacing 
between  vehicles.  The  greater  the  speed  the  larger  is  the  volume, 
and  the  greater  the  spacing  the  less  is  the  volume.  Therefore, 

Speed 

Volume  = 

Spacing 

This  same  reasoning  applies  to  any  number  of  lanes  in  the  same 
direction,  but  with  more  than  one  lane,  passing  takes  place,  which 
adds  another  factor  to  be  considered.  For  the  sake  of  simplicity,  we 
shall  firsktake  up  the  theoretical  capacity  of  a  single  lane. 

4n  general,  anyone  who  has  observed  traffic  knows  that  as 
speeds  increase,  the  spacing  between  vehicles  increases.  If  the 
spacing  increases  at  a  greater  rate  than  the  speed,  then  there  is  an 
optimum  speed  that  gives  a  maximum  volume.  If  the  spacing  in- 
creases at  a  rate  equal  to  or  less  than  the  speed,  then  the  higher 
the  speed  the  greater  the  volume.  The  question  of  minimum 
spacing  needs  to  be  examined  critically. 

The  original  assumption  was  that  drivers  should  and  did  main- 
tain a  safe  stopping  distance  behind  the  vehicle  ahead.  This  safe 
stopping  distance  was  based  on  the  possibility  that  the  car  ahead 
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might  stop  instantaneously.  This,  of  course,  practically  never  hap- 
pens for  it  can  take  place  only  through  some  unusual  occurrence 
such  as  the  head-on  collision  of  two  vehicles.  That  the  original 
assumption  of  minimum  spacing  persists  is  evidenced  by  an  article 
in  Traffic  Engineering  for  August,  1950,  by  Dr.  Victor  F.  Hess, 
Physics  Department,  Fordham  University,  New  York.2  It  should 
be  mentioned  that  Dr.  Hess  is  deriving  a  formula  for  safe  travel 
at  a  maximum  efficiency.  This  article  states  accurately  that  the 
stopping  distance  includes  (I)  a,  the  distance  the  vehicle  travels 
during  the  "reaction  time",  (time  interval  between  the  stop  signal 
observed  and  the  instant  the  brakes  are  applied)  and  (2)  b,  the 
distance  the  vehicle  travels  after  the  brakes  are  applied.  The  dis- 
tance a  is  proportional  to  the  speed  of  the  car  v. 

a  =  tv 
Distance  b,  the   braking  distance,  is  the  distance  required  to 
absorb  the  kinetic  energy  of  the  vehicle  i}j2  mv2),  and  therefore 
must  vary  with  the  square  of  the  velocity;  that  is 

b  =  kv2 
in  which  the  constant  k  is  a  factor  depending  upon  the  efficiency 
of  the  brakes  and  the  coefficient  of  friction  between  the  tires  and 
the  pavement.  The  stopping  distance  is  equal  to 

a  +  b  =  tv  +  kv2 
in  which  t  =  reaction  time,  which  is  usually  taken  as  .75  second. 

V.  4.  Stopping  Distance  And  Minimum  Spacing.  Observations 
have  proved  that  the  stopping  distance  is  not  the  minimum  spac- 
ing between  vehicles.  This  fact  may  also  be  arrived  at  by  inductive 
reasoning. 

If  we  assume  that  two  vehicles  are  mechanically  equivalent  and 
traveling  at  the  same  speed,  then  one  can  be  stopped  in  the  same 
distance  as  the  other,  and  if  they  both  start  to  stop  at  the  same 
instant,  they  will  come  to  rest  at  the  same  distance  apart  as  when 
the  brakes  were  applied.  The  fact  that  the  brakes  cannot  be 
applied  at  the  same  time  results  from  the  rear  driver's  needing 
time  to  react.  What  takes  place  is  that  the  driver  sees  the  car 
ahead  start  to  stop  and  then  reacts  and  applies  his  brakes.  This 
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reasoning  leads  to  the  conclusion  that  the  minimum  spacing  be- 
tween vehicles  consists  of  the  distance  required  for  reaction  plus  an 
additional  distance  which  the  driver  maintains  as  a  safety  factor. 
This  factor  of  safety  distance  may  be  quite  small. 

From  photographic  observations  of  vehicles  traveling  in  queues 
so  that  each  one  could  be  assumed  to  be  traveling  at  minimum 
spacing,  it  was  found  that  the  average  minimum  spacing  in  feet 
was  approximately  s  =  l.lv  +  21  in  which  v  =  speed  in  miles 
per  hour*.3  The  factor  1.1  corresponds  to  the  reaction  time  of 
.75  seconds  if  the  speed  is  given  in  feet  per  second.  The  21  feet  is 
the  spacing  when  v  =  0,  and  includes  the  length  of  the  vehicle. 
This  factor  was  determined  in  1933,  for  a  given  composition  of 
traffic  and  would  evidently  not  apply  in  all  conditions.  It  may  be 
noted  that  if  the  spacing  is  expressed  in  time,  it  tends  to  be  a 
constant.  At  20  m.p.h.  the  time  spacing  would  be  1.46  seconds;  at 
30  m.p.h.,  1.2  seconds;  and  at  40  m.p.h.,  1.1  seconds. 

Observations  in  urban  traffic  have  shown  that  the  average 
minimum  spacing  between  vehicles  expressed  in  time  is  practically 
a  constant,  regardless  of  speed.  In  one  case,  it  was  found  to  be 
1.1  seconds  for  all  speeds  which  were  low.4 

In  Part  3  of  the  Capacity  Manual,  Figure  I  shows  the  minimum 
spacings  given  in  the  table  below.  These  spacings,  if  we  assume  a 
reaction  time  of  .75  seconds,  may  be  divided  into  a  reaction- judg- 
ment distance  plus  a  braking  distance. 

Table  V.l 


Speed 

Observed 
Minimum 

Reaction 
Distance 

Additional 
Braking 

Ratio  of 
Braking 

Ratio  of 

Spacing 

.75  Seconds 

Distance 

Distances 

10 

4:4: 

11 

33 

33/38  =  -87 

io  /20    —  0.25 

20 

60 

22 

38 

38/47    =    -81 

2o73o2  =  0.45 

30 

80 

33 

47 

47/64  =  .73 

ao2/4o2  =  0.56 

40 

108 

44 

64 

64/85  =  -75 

4o2/so2  =  0.64 

50 

140 

55 

85 

*  Coupare  with  the  formula  s  =  0.909  v  (III.  23.2)  which  was  based  on 
data  which  did  not  include  zero  speeds. 
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The  braking  distances  for  stopping  should  be  proportional  to  the 
square  of  the  speeds,  but  as  shown  in  the  table,  the  minimum 
spacings  are  not  proportional  to  this  amount.  This  is  additional 
evidence  that  minimum  spacings  do  not  depend  on  braking  ability. 

V.  5.  Interpretation  of  Minimum  Spacing  Formula.  The  formula 
s  =  1.1  v  +  21  would  give  a  maximum  traffic  flow  of  about  4000 
vehicles  per  hour  per  lane.  This,  of  course,  is  never  realized  except 
momentarily.  If  a  stream  of  traffic  were  moving  at  this  minimum 
spacing,  the  slowing  or  stopping  of  any  vehicle  would  immediately 
affect  all  following  vehicles.  The  formula  is  not  given  because  of 
its  practicability  but  because  it  points  to  two  significant  facts. 

a.  The  volume  increases  with  speed,  but  apparently  approaches 
a  maximum  point  at  about  40  miles  per  hour  where  the  con- 
stant 21  ceases  to  be  significant. 

b.  The  minimum  spacing  depends  primarily  on  "reaction- 
perception-judgment"  time. 

V. 6.  Limiting  Factors.  To  summarize:  The  factors  that  limit  the 
capacity  of  a  highway,  are : 

1.  Necessary  minimum  clearance  between  vehicles. 

2.  Slow  moving  vehicles  that  retard  others,  when  passing  is  not 
possible,  due  to  lack  of  space  on  the  opposite  lane  or  to  re- 
stricted sight  distance. 

3.  Reduced  overall  speeds  caused  by  the  physical  features  of  the 
highway,  the  mechanical  characteristics  of  vehicles,  or  the 
desire  of  drivers. 

These  factors  need  to  be  studied  in  as  much  detail  as  possible  if  we 
are  to  reach  a  clear  conception  of  the  problem  of  measuring  the 
ability  of  a  highway  to  accommodate  traffic. 

V.  7.  Additional  Relationships  of  Spacing  and  Speed.  In  a  study 
made  in  Ohio  in  1934,4  it  was  found  that  there  is  a  straight  line  re- 
lationship between  average  density  in  vehicles  per  mile  (spacing) 
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and  average  speed.  As  the  density  increases,  the  speed  decreases. 

Expressed  in  the  form  of  an  equation 

Speed 

Density 

where  k  is  a  constant  for  a  given  roadway  and  composition  of 

traffic.  If  this  relationship  is  true,  and  it  was  based  on  observations 

of  over  220  groups  of  100  vehicles  each,  it  means  that  with  a  given 

highway  and  composition  of  traffic  the  potential  capacity  range 

can  be  obtained  by  getting  the  speeds  at  a  low  density  and  at 

a  high  density  since  two  points  determine  a  straight  line. 

Speed  ,  ,  , 

That  the  relationship ==  k  may  be  only  approximately 

Density 

true  is  indicated  by  information  given  in  Figure  5,  page  31,  of  the 

Highway  Capacity  Manual. 

This  figure  indicates  that  there  is  a  straight-line  relationship 

between  speed  and  volume  of  vehicles  per  hour.  The  equation  of 
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the  curve  for  "the  majority  of  existing  highways"  as  nearly  as 
may  be  judged  from  the  Figure,  is 

S  =  43  -  .009  V, 

where  S  equals  speed  in  miles  per  hour  and  V  equals  volume. 
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Figure  V.2 

Average  Speed  of  all  Vehicles  on  Level,  Tangent  Sections 

of  2-Lane  Rural  Highways 

(Figure  5,  page  31,  "Highway  Capacity  Manual",  Used  by  Permission  of  Bureau  of  Public 
Roads,  U.S.  Department  of  Commerce.) 

Letting  D  —  density  in  vehicles  per  mile  of  roadway,  V  =  D  •  S, 
so  that 

S  =  43  -  .009  V  =  43  -  .009  D  •  S 
or 

43 

S  = 


1  +  .009D 
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By  plotting  speed  against  density  Figure  V.3.  is  obtained.  The 
graph  has  very  little  curvature  being  nearly  a  straight  line.  Hence 
for  practical  purposes  it  may  be  assumed  with  slight  error  that 
speed  varies  directly  (i.  e.  lineally)  with  density.  It  appears  that 
this  may  be  as  nearly  correct  as  the  assumption  that  speed  varies 
directly  and  lineally  with  volume. 
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Figure  V.3 
Average  Speed  of  all  Vehicles  on  Level,  Tangent  Sections 
of  the  Majority  of  Existing  2-Lane  Main  Rural  Highways 


Returning  to  the  19344  report  it  will  be  noted  that  in  Figure  V.l. 
(taken  from  page  468  of  the  report)  the  point  that  is  marked  "free 
speed"  indicates  that  practically  no  drop  in  speed  on  the  two-lane 
roadway  was  observed  until  the  volume  reached  about  400  ve- 
hicles per  hour.  The  figures  near  the  curve  show  the  number  of 
groups  of  100  vehicles  each  for  which  the  point  marked  is  the 
weighted  average.  The  maximum  possible  volume  was  not  ob- 
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served  directly,  but  was  obtained  by  assuming  that  the  curve  was 

a  straight  line.  The  "free  speed"  for  the  curve  shown  was  43.8 

m.p.h.  This  point  is  indicated  to  be  about  ten  units  to  the  right 

since  no  noticeable  speed  drop  was  observed  until  the  volume 

reached  about  400  vehicles  per  hour.  The  maximum  possible 

volume  would  come  at  the  mid-point  of  the  curve  and  would  equal 

46       195 

—  X  —  =  2,300  (approx.)  vehicles  per  hour.  That  the  mid-point 

of  the  curve  gives  the  maximum  volume  is  easily  proved. 

Let  Sm  =  maximum  speed  and  Dm  =  maximum  density,  then 

S 
Slope  of  curve  =  —  — — 

Dm 

Let  x  =  varying  values  of  D,  then  V  =  ( S  —  x  — — )  x 


Sx-x^ 

Dm 

Differentiating  with  respect  to  x 

dx  Dm 

s 

For  maximum  volume  S  —  2  x  — —  =  0 

Dm 

whence,  x  ==  — —  =  mid-point  of  the  curve. 

If  this  straight-line  relationship  holds,  then  the  maximum  capacity 
varies  over  a  small  range,  since  the  end  points  of  the  line  are  fixed 
by  the  maximum  average  speed  and  the  minimum  spacing  which 
have  small  variations. 

V.  8  Volume  and  Speed.  If  volume  is  plotted  against  speed,  the  re- 
sulting curve  is  given  in  Figure  V.4.  This  curve  shows  that  there 
is  a  maximum  volume  and  also  that  there  are  two  speeds  that  give 
the  same  volume.  At  the  lower  speed,  there  is  considerable  time 
loss,  Figure  V.5. 
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Figure  V.4 

Speed  in  Miles  per  Hour  Corresponding  to  a  Given  Volume 

in  Vehicles  per  Hour  on  a  2-Lane  Highway 


These  curves  bring  out  the  fact  that  capacity  needs  to  be  ex- 
pressed in  terms  of  both  volume  and  speed.  At  maximum  volume 
there  is  always  a  considerable  time  or  speed  loss.  The  maximum 
volume  is  evidently  not  a  design  volume. 

The  Capacity  Manual  gives  a  great  deal  of  evidence  that  there 
are  definite  relationships  between  speeds  and  volumes.  This  is 
brought  out  by  numerous  curves  which  show  such  information  as 
the  number  of  drivers  desiring  to  pass  compared  to  the  number 
that  have  an  opportunity  to  pass,  the  total  percentage  of  the  time 
that  desired  speeds  can  be  maintained,  and  the  point  at  which 
drivers  become  influenced  by  the  presence  of  vehicles  .ahead  of 
them.  Using  the  facts  set  forth  in  the  manual,  it  is  our  purpose  to 
see  if  there  is  a  rational  explanation  of  the  interrelationships  of  the 
different  phases  of  the  behavior  of  drivers  that  can  be  expressed 
mathematically. 
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V.  9.  The  Nature  of  the  Problems  of  Highway  Traffic.  We  have  dis- 
cussed some  of  the  elements  of  the  problems  of  highway  capacity, 
but  have  said  very  little  about  the  nature  and  variability  of  these 
elements.  It  is  this  variability  that  makes  it  difficult  to  solve  the 
problems  involved.  If  all  vehicles  traveled  at  the  same  speed,  or  if 
all  people  reacted  in  the  same  time  interval,  or  if  all  drivers  main- 
tained the  same  spacing  at  the  same  speed,  the  solutions  would  be 
comparatively  easy. 

There  is  nothing  new  about  the  idea  that  the  behavior  pattern 
of  drivers  is  a  stochastic  variable.  One  of  the  writers  found  in  1933, 
as  already  mentioned,  that  the  minimum  spacing  depended  prim- 
arily on  reaction-time  which  psychologists  have  long  recognized  as 
a  stochastic  variable.3  Mr.  John  P.  Kinzer  assumed  in  1934,  that 
the  traffic  distribution  on  a  roadway  followed  a  "random"  or 
Poisson  distribution.8  In  England,  Mr.  William  F.  Adams  found 
that  free  flowing  traffic  conformed  so  well  to  the  distribution  given 
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by  a  random  series  that  it  might  be  described  as  "normal."  That 
the  time  spacings  between  vehicles  follow  a  random  series  in  urban 
traffic  was  reaffirmed  by  a  study  made  in  1 944-46. 7 

V.  10.  Spacing  as  a  Random  Series.  The  assumption  that  spacing  in 
either  time  or  distance  units  follows  the  "random"  series  furnishes 
a  means  of  studying  the  nature  of  spacing.  To  satisfy  the  conditions 
of  the  Poisson  series,  a  roadway  would  have  vehicles  scattered 
along  it  at  random  so  that  any  vehicle  would  be  completely  in- 
dependent of  any  other  vehicle,  and  equal  segments  of  the  road 
would  be  equally  likely  to  contain  the  same  number  of  vehicles. 
Granting  that  these  conditions  exist,  the  total  number  of  vehicles 
on  a  roadway  divided  by  the  number  of  segments  of  road  equals 
"m"  the  average  number  of  vehicles  per  segment.  Then,  according 
to  the  Poisson  series,  the  probability  of  zero  vehicles  appearing  in 
a  segment  is 

/mc 

e"m  — 

\0! 

The  probability  of  one  vehicle  appearing  is 

/m1 

The  probability  of  two  vehicles  appearing  is 

e-m(??) 

and  the  probability  of  n  vehicles  appearing  is 


e  ~~m 
'n! 


The  sum  of  all  the  individual  probabilities  is 
m°      m1      m2  mn 


162         STATISTICS  AND  HIGHWAY  TRAFFIC  ANALYSIS 
But 

em  = 
Therefore, 


mu      m1 

o!~  +  ~iT 


+  -T  + 
n! 


e-m.em  —  e0_  i 

This  simply  demonstrates  what  we  know,  namely  that  the  sum  of 
all  probabilities  is  unity,  which  means  that  an  event  is  certain  to 

Table  V.2 

Fitting  of  Poisson  Curve  by  Chi- Square  Test 

Number  of  Vehicles  Appearing  in  Five-Minute  Intervals 

Observations  Taken  on  U.S.  20  Near  Oaklawn,  Illinois.  Data  Supplied  by  the  U.S.  Public  Roads 

Administration. 
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happen  or  not  to  happen.  In  this  case,  it  means  that  any  segment 
is  sure  to  contain  zero  or  more  vehicles  since  this  covers  all  alter- 
natives. 

V.  11.  Test  of  Goodness  of  Fit  of  the  Poisson  Series.  The  goodness 
of  fit  of  the  Poisson  Series  to  a  set  of  data  may  be  tested  by  the 
Chi-square  (yj)  test.  A  cumulative  Poisson  table  of  probabilities 
is  used  to  obtain  the  theoretical  frequencies.  The  data  in  the  illu- 
strative example  consist  of  the  numbers  of  vehicles  appearing  in 
five  minute  intervals  on  Route  U.S.  20  near  Oaklawn,  Illinois.  The 
volume  of  flow  averaged  about  115  vehicles  per  hour.  These  data 
were  made  available  by  the  Public  Roads  Administration. 

The  first  two  columns  in  Table  V.2.  show  the  observed  data. 
The  figures  in  Column  Three  are  taken  from  a  Poisson  table. 
Column  Four  is  found  by  multiplying  the  figures  in  Column  Three 
by  the  number  of  intervals  observed  (N  =  328)  to  obtain  the  theo- 
retical frequency.  Column  Five  gives  the  differences  between  the 
observed  or  actual  frequencies  and  the  theoretical.  Note  that  in 
this  column  the  first  two  terms  and  the  last  four  in  Column  Four 
have  been  combined  to  obtain  a  minimum  actual  or  theoretical 
frequency  that  must  be  five  or  more.  Column  Six  gives  the  square 
of  these  differences.  The  figures  in  Column  Six  divided  by  the 
theoretical  frequency  give  the  values  in  Column  Seven.  The  sum 
of  these  values,  7.747,  equals  "Chi-square"  (x2). 

The  degrees  of  freedom  are  equal  to  the  number  of  classes  less  2, 
i.  e.,  9  —  2  ==  7.  From  a  Chi-square  table  of  probability  levels,  it 
is  found  that  the  probability  level  is  about  .60  or  60  per  cent. 

A  5  per  cent  level  is  usually  taken  as  sufficient  to  indicate  that 
there  is  reason  to  reject  the  hypothesis  that  the  data  can  be 
represented  by  the  curve.  Therefore,  the  present  level  of  about 
60  per  cent  is  taken  to  be  rather  conclusive  evidence  that  the  data 
may  be  represented  by  the  Poisson  Curve. 

V.  12.  Test  of  Goodness  of  Fit  of  the  Poisson  Series  to  the  Distribu- 
tion of  Spacings  Between  Vehicles.  As  already  mentioned  we  are  also 
interested  in  the  distribution  of  the  time  or  distance  spacings  be- 
tween successive  vehicles.  It  is  these  time-gaps  on  the  opposite 
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Table  V.3 

Fitting  of  Poisson  Curve  by  Individual  Terms  Table 
Time  Spacing  Between  Vehicles  (Chi-square  Test) 

Frequency  Distribution  of  Time  Spacings  Between  Vehicles  on  a  Two-Lane  Highway  (Routes 
U.S.  50  and  240  in  Maryland).  Data  Furnished  by  Public  U.S.  Roads  Administration. 
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lane  that  are  used  in  passing.  We  shall  now  check  the  goodness  of 
fit  of  the  time  spacing  distribution  to  the  Poisson  Curve.  The  data 
were  taken  on  Route  U.S.  240,  Maryland,  and  were  furnished  by 
the  Public  Roads  Administration.  The  Chi-square  test  will  be  used. 

According  to  this  method  as  shown  in  Table  V.  3.,  it  is  immediately 
evident  that  there  is  a  wide  discrepancy  between  the  actual  and 
the  theoretical  frequencies.  The  probability  level  is  practically  zero. 

If  the  distribution  of  time  gaps  between  vehicles  is  not  a  Poisson 
series,  what  is  it  ?  To  determine  this,  let  us  re-examine  the  nature 
of  the  Poisson  series  when  applied  to  spacing  distribution. 

The  probability  of  the  occurrence  of  a  time  or  distance  gap  of  a 
given  length  is  the  probability  that  no  vehicle  will  appear  in  the 
given  interval. 

For  example,  given  a  volume  of  400  vehicles  per  hour,  let  it  be 
required  to  determine  the  probability  "P0"  of  a  one  second  interval 
having  no  vehicle.  The  average  number  of  vehicles  per  second  "m" 
is  equal  to  400/.36oo  ;=  i ',  therefore,  the  probability  of  a  one  second 
interval  having  no  vehicle  is  equal  to 

0!/  \0!. 

=  e-*,  since  ^)=,   ;  =1 

The  probability  of  no  vehicle  appearing  in  2  seconds  is  e~t,  and 
in  3  seconds  e~f .  In  general,  the  probability  P0  of  there  being  no 
vehicles  in  "s"  seconds  is  equal  to  e_m.  This  equation  is  of  the 
general  form  of 

y  =  e* 
which  may  be  written 

loge  y  =  x 
therefore  the  equation  when  plotted  on  semilog-paper  becomes  a 
straight  line.  The  exponent,  -m,  means  that  the  slope  of  the  line 
is  negative. 

For  plotting  on  semi-log  paper  we  first  arrange  the  data,  as 
shown  in  the  cumulative  Table  V.4.  where  the  percentages  of 
spacings  equal  to  or  less  than  a  given  interval  are  tabulated. 
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Table  V.4 
Fitting  of  Poisson  Curve 
Expected  Error  Method 


Class  interval 
in  seconds 

Class 
frequency 
(f) 

Cumulated 
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(fo) 

Cumulated 
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error  or 
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uncertainty 

Expected 
error  in 
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0 

0 
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Mean  =  =  6.585 
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These  percentages  are  represented  by  the  heavy  dots  which  fall  in 
an  irregular  line  as  shown  in  Fig.  V.6.  This  is  to  be  expected  for 
unless  a  sample  is  very  large  there  is  always  a  "natural  uncertainty" 
or  difference  between  the  sample  values  and  those  of  the  universe. 
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Figure  V.6 

Graph  Showing  Percentage  of  Vehicle  Spaclngs 

and  the  Probable  Amounts  of  the  "Natural  Uncertainty" 

of  the  Plotted  Points 
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A  fair  measure  of  this  uncertainty  is  the  standard  deviation  of  a 
class  or  sample.  The  formula  for  this  natural  uncertainty  is 


Z  = 


•-» 


where  n  equals  the  total  number  of  happenings  recorded,  and  f0 
equals  the  accumulated  frequency.  Since  n  in  the  present  case  is 

660, is  so  nearly  equal  to  1  that  it  may  be  omitted  and  the 

equation  becomes : 


An  examination  of  this  formula  shows  that  the  uncertainty 
depends  upon  the  size  of  the  sample  and  not  upon  the  size  of  the 
universe.  It  may  seem  a  little  paradoxical  that  a  20  per  cent  sample 
may  be  no  more  representative  of  the  universe  than  a  10  per  cent 
sample.  If,  however,  we  recall  that  the  size  of  the  universe  may  be 
considered  to  be  infinite,  and  this  is  practically  true  of  traffic,  then 
no  sample  is  any  nearer  than  any  other  to  including  all  the  uni- 
verse. With  this  in  mind  it  is  entirely  logical  that  the  size  of  the 
universe  does  not  appear  in  the  formula  for  the  measure  of  uncer- 
tainty. 

If  we  could  draw  a  line  through  the  plotted  points  and  stay 
within  the  natural  uncertainty  range  we  could  conclude  that  the 
data  could  be  represented  by  a  straight  line.  But  this  is  not  the 
case  as  can  be  seen  in  Figure  V.6.,  so  it  must  be  that  the  distribu- 
tion of  spacings  is  not  the  special  case  of  the  Poisson  series  which 
may  be  represented  by  the  curve  e_m. 

It  appears,  however,  that  the  data  can  be  closely  represented  by 
two  straight  lines.  This  implies  that  there  may  be  two  distribu- 
tions, one  for  spacings  less  than  about  4  seconds  and  another  for 
spacings  of  more  than  that  and  that  each  is  "random"  in  the 
limited  case. 

If  we  take  the  class  intervals  equal  to  5  seconds  in  order  to 
smooth  the  curve  we  obtain  the  points  shown  in  Figure  V.7.  which 
is  approximately  a  straight  line.  This  indicates  that  if  we  are  not 
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distribution  of  spacings  between  successive  vehicles : 
Class  Intervals  Equal  to  5  Seconds 

concerned  with  spacings  of  less  than  5  seconds  that  the  straight 
line  represents  the  distribution  of  the  spacings  closely  enough  for 
approximate  analysis. 

V.  13.  Minimum  Spacing .  For  what  is  believed  to  be  the  first  indi- 
cation that  minimum  spacing  distributions  might  be  different  from 
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those  at  greater  distances,  we  refer  to  a  study  made  in  Ohio  in 
1934. 5  The  cumulative  frequency  curve  shown  in  Figure  V.8.  is 
plotted  from  data  collected  at  that  time.  The  spacings,  center  to 
center  of  vehicles,  are  in  feet. 
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Figure  V.8 

Cumulative  Frequency  Curve 
of  Spacings  between  Successive  Vehicles 


It  is  indicated  that  the  minimum  spacing  distribution  is  random 
and  that  it  extends  from  about  30  feet  to  200  feet.  Evidently 
there  are  few,  if  any,  spacings  below  30  feet,  and  beyond  200  feet 
there  is  another  random  distribution  different  from  that  below 
200  feet.  This  may  be  interpreted  to  mean  that  the  distribution  at 
less  than  200  feet  varies  in  accordance  with  the  reaction-perception 
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time  of  the  driver  and  his  judgment  of  what  constitutes  a  safe 
distance.  Beyond  200  feet,  the  spacing  may  be  judged  to  be  in 
accordance  with  the  chance  placement  of  the  vehicles  on  the  high- 
way. If  the  observed  results  are  compared  with  the  theoretical 
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Figure  V.9 
Cumulative  Frequency  Curve  of  Spacings  between  Succes- 
sive Vehicles  for  Various  Traffic  Volumes  on  a  Typical 
2-Lane  Rural  Highway 

curve,  it  is  found  that  the  deviations  from  the  random  distribution 
are  accounted  for  by  there  being : 

(a)  No  spacings  below  30  feet. 

(b)  An  excess  of  spacings  between  30  and  200  feet. 

(c)  A  deficit  of  spacings  in  excess  of  200  feet. 
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These  discrepencies  are  logical,  for  the  minimum  spacing,  center 
to  center  of  vehicles,  is  limited  by  the  length  of  the  vehicles  and 
because  vehicles,  closing  up  behind  slower  vehicles  must  wait  for 
an  opportunity  to  pass,  create  a  preponderance  of  the  smaller  spac- 
ings. 

If  the  spacing  of  about  200  feet  is  divided  by  the  average  speed 
of  34.1  miles  per  hour  we  obtain  about  4  seconds  as  the  limit  of  the 
zone  of  speeds  reduced  by  the  presence  of  other  vehicles.  These 
data  from  two  locations,  would  not  be  supposed  to  give  a  conclusive 
answer. 

For  more  extensive  data,  let  us  turn  to  Figure  9,  page  40  of  the 
Capacity  Manual.  These  data  replotted  as  nearly  as  is  possible 
from  the  printed  curves  are  shown  in  Figure  V.9.  They  are  in  time 
spacings  and  the  breaks  in  the  curves  seem  to  come  between  five 
and  six  seconds. 

Theoretically,  if  the  lines  had  no  breaks  there  would  be  no  inter- 
ference, and  if  all  vehicles  were  restricted  there  would  be  no  breaks. 
These  conditions  were  found  and  reported  in  the  earlier  paper  re- 
ferred to.  To  find  the  average  of  the  "influenced"  spacings  we  first 
make  the  reasonable  assumption  from  the  graphs  that  practically 
no  spacings  are  under  a/2  second  or  over  6  seconds,  and  draw  a  line 
between  these  points  as  in  Figure  V.10.  This  line  then  represents  a 
random  distribution  of  "influenced"  spacings. 

The  next  step  is  to  let  S  =  m,  where  m  is  the  average  spacing. 
Now  the  expression 

100[e^"J=  100  (e  -1)  =  0.368  =  36.8% 

so  that  the  average  would  be  at  point  36.8  per  cent  and  would  equal 
about  1.7  seconds.  At  this  average  "random"  spacing  all  vehicles 
would  be  travelling  at  a  restricted  speed  due  to  the  closeness  of 
spacing  between  vehicles. 

V.  14.  The  Minimum  Spacing  of  Four -Lane  Traffic  Traffic  on  a 
four-lane  highway  does  not  have  the  same  spacing  restrictions  as  a 
two-lane  roadway.  Vehicles  are  free  to  weave  into  the  adjoining 
lane.  When  the  curves  shown  in  Figure  10,  page  41  of  the  Capacity 
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12  3  4  5  6 

Time   Spacing  Between  Successive  Vehicles  in  Seconds 

Figure  V.10 
Random  Distribution  of  "influenced"  Spacings 

Manual  are  replotted  as  shown  in  Figure  V.ll.,  the  resulting 
curves  show  no  breaks.  The  distribution  of  timespacings  is  evident- 
ly random  throughout. 

V.  15.  Frequency  Distribution  of  Speeds:  Having  determined  the 
characteristics  of  the  spacing  distributions,  the  next  step  is  that 
of  determining  the  nature  of  the  distribution  of  automobile  speeds. 
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Table  V.5 

Calculation  of  Standard  Deviation 

or  Distribution  of  Vehicle  Speeds 


—  44 


Arithmetic  Mean  =  X  =40.8  miles  per  hour 


1 

2 

3 

4 

5 

Speed  in 
Miles  per  hour 

Observed  no. 
of  speeds  =  f0 

Deviation  in 
class  Intervals 

f0d 

fod2 

20.6 
25.4 

5 

—  4 

—  20 

80 

25.6 
30.4 

7 

—  3 

—  21 

63 

30.6 
35.4 

19 

—  2 

—  38 

76 

35.6 
40.4 

23 

—  1 

—  23 

23 

40.6 
45.4 

13 

0 

0 

0 

45.6 
50.4 

15 

1 

15 

15 

50.6 
55.4 

12 

2 

24 

48 

55.6 
60.4 

5 

3 

15 

45 

60.6 
65.4 

1 

4 

4 

16 

S  =  5 


1/2U 


366 


(d»)     /2f0d\a 


N 


=  5.0  J/ 


366 
100 


-44\2 
100/ 


=  5.0y(3.66  -.1936) 

=  5  |/ (3.4664) 

=  5  (1.862)  =  9.31 

=  standard  deviation 


176 


STATISTICS  AND  HIGHWAY  TRAFFIC  ANALYSIS 


Q 

o 
X 

H 
H 

?i 

H 
M 

EQ 

3 
o 

QQ 

Q 
H 

H 

CM 

H 

O 


W 


* 

.007 

OS 

q 

co 

CXI 

CO 

OS 

oq 

lO 

q 

CO 

l> 

co 
q 

1—1 

«('j-°j) 

o 

S(*J-°J) 

© 

o 

CD 
CO 

IO 

o 

CO 

CO 

Oq 

o 

q 

Oq 

os 
oq 

00 
CO 

«35 

<*j-°j) 

OS 
IN 

oq 

q 

l> 
lO 

oi 

IO 
IO 

CO 

7 

CO 

oq' 

os 
»o 

OS 
CM 
Oq 

id 

00 

*J  =  (%)N  = 
fioudribdAJ. 

fDOltdXOdllJj 

so 

CO 

to 

00 

00 
OS 

CO 

d 

d 

o 

CO 

CO 
CO 

d 

oo 
q 

<* 

CO 

co# 

I> 

V9U.V  /0  lUdOJdJ 

co 

CO 

co 

00 

00 

os 

I— 1 

CO 

d 
oq 

IO 

OS 

o 

CO 

CO 
CO 

d 

00 

q 

<<* 

CO 
CO 

co 

WD9UI  pUV  %IWI\ 

ssvjo  udQMjdq 

o 

lO 

GO 

IO 

00 

(— I 

oq 

CO 
CO 

CO 

oq 

lO 

00    l> 

00 

IO 

00 

t£ 

CO 

00 
<* 

co 

00 

os 

lO 

d 

*Q 

—-=uoiiviad(j 

pjvpuvig 
wi  f  uianioQ 

1 

CO 
CO 

7 

oa 

q 

7 

CO 
IO 

1 

oq  os 
q  "* 

1    + 

CO 

q 

iq 

i—i 
oq 

CO 

oq 

T* 

UVdUL  WOJLJ. 

lyimi  ssvp 
/o  uoitmadQ 

oq 

o 

Oq 

1 

oq 
id 

7 

oq 

d 

7 

1 

oq  q 
-<* 

1    + 

CO 

CO 

CO 

OS 

eq 

oq 

CO 

fioudnbdx\. 
p9au,9sqo  =  0f  = 
spdddg  jo'Oft 

IO 

i> 

co 

CO 

1—1 

oq 

lO 

- 

Oq 

8SDJ0  fo 

luiod-piffl 

CO 

oc 
oq 

CO 
CO 

00 
CO 

co 

oo 

CO 
IQ 

00 
IO 

CO 
CO 

<—< 

wi  pdddg 

CO    TjJ 

©  io" 

CO    tJH 

id  d 

(M    CO 

CO     TjJ 

d  io 

CO    CO 

CO    <* 

id  d 

CO    ^1 

CO    -«* 

d  >o 

co  ^ 
id  d 

Tt<     IO 

q  -* 
d  id 

IQ     IO 

CO    tH 

io  d 

lO    CO 

co  ^ 

d  id 
co  co 

APPLICATION'S  OF  STATISTICAL  METHODS  177 

It  has  been  found  that  this  distribution  closely  follows  the 
* 'normal  curve."  Again  as  in  the  two  previous  examples  of  "ran- 
dom" distribution,  the  usual  method  of  making  a  test  of  the  good- 
ness of  fit  is  the  Chi- Square  (x2)  test.  For  the  sake  of  simplicity, 
let  us  take  a  small  sample  of  100  recorded  speeds.  The  area  method 
of  fitting  a  normal  curve  to  the  observed  distribution  will  be  used. 
The  area  included  within  any  number  of  standard  deviations  may 
be  obtained  from  prepared  tables  of  areas  of  the  normal  curve.  The 
calculation  of  the  standard  deviation  is  shown  in  Table  V.5. 

The  steps  in  the  calculation  are  arranged  as  shown  in  Table  V.6., 
with  the  data  in  the  respective  columns  consisting  of  the  following : 

(1)  The  speeds  in  class  intervals  of  5  miles  per  hour. 

(2)  The  mid-points  of  the  classes. 

(3)  The  number  of  speeds  recorded,  i.  e.  the  frequency  f0. 

(4)  The  deviations  of  the  class  limits  from  the  arithmetic  mean. 

(5)  The  deviations  from  the  mean  in  terms  of  standard  devia- 
tions. This  column  is  obtained  by  dividing  the  numbers  in 
column  4  by  the  standard  deviation. 

(6)  Per  cent  of  the  area  between  the  class  limit  and  the  mean. 
This  is  obtained  from  an  area  table  of  the  normal  distribu- 
tion. 

(7)  Per  cent  of  area  in  class  interval.  This  is  obtained  by  sub- 
tracting successively  the  numbers  in  column  6. 

(8)  The  theoretical  frequency  ft  is  obtained  by  multiplying  the 
per  cent  of  area  in  each  class  interval  by  the  total  number 
of  speeds  observed.  This  equals  100  in  the  present  case. 

(9)  This  column  gives  the  difference  between  the  observed  fre- 
quency f0  (column  3)  and  the  theoretical  frequency  ft 
(column  8). 

(10)  This  column  is  obtained  by  squaring  the  items  in  column  9. 

(11)  The  sum  of  the  items  in  this  column  equals  x2-  This  is  the 
value  we  use  with  the  Chi-square  table. 

In  using  the  chi-square  table  we  need  to  know  the  degrees  of 
freedom.  In  fitting  a  normal  distribution  three  degrees  of  freedom 
are  lost  (or  three  constraints  are  imposed)  because  (1)  the  total 
frequency,   (2)  the  arithmetic  mean,  and  (3)  the  value  of  the 
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standard  deviation  are  used  in  computing  the  normal  frequencies. 
The  possible  number  of  degrees  of  freedom  is  equal  to  the  number 
of  class  intervals,  7  in  this  case.  Therefore,  7  —  3  =  4,  the  degrees 
of  freedom  in  the  given  example. 

We  find  from  the  Chi-square  table  that  the  probability  level  is 
more  than  5  per  cent  which  means  that  in  more  than  5  times  out 
of  100  the  sample  could  have  come  from  the  universe  tested.  This 
level  of  5  per  cent  is  taken  to  mean  that  there  is  not  sufficient  evi- 
dence to  reject  the  hypothesis  that  the  data  can  be  represented  by 
a  normal  curve.  In  the  present  case  the  probability  is  more  than 
.30  which  means  that  a  variation  as  great  as  the  amount  found 
might  occur  in  30  cases  out  of  100  due  to  chance.  Therefore  it  is 
not  to  be  considered  as  significant. 

V.  16.  A  Graphical  Method  of  Determining  Goodness  of  Fit.  Another 
means  of  determining  whether  the  distribution  is  normal  or  not  is 
to  plot  the  percentage  of  speeds  at  or  less  than  various  speeds  on 
arithmetic  probability  paper.  If  the  distribution  is  "normal"  the 
observed  data  will  be  represented  by  a  straight  line.  In  such  a  case, 
due  to  symmetry  the  speed  given  by  the  intersection  of  the  straight 
line  with  the  50  per  cent  ordinate  is  the  most  frequent  and  average 
speed,  as  well  as  the  median.  The  usual  definitions  become : 

Mean  Average  Speed  =  arithmetical  mean  of  all  speeds  -  also 
called  probable  or  expected  speed. 

Median  Speed  =  speed  such  that  50  per  cent  of  the  speeds  are 
greater,  and  50  per  cent  less. 

Modal  Speed  =  the  most  frequently  occurring  speed. 

The  data  utilized  are  the  numbers  of  cars  with  speeds  equal  to  or 
less  than  a  given  series  of  equally  spaced  values.  The  same  data  will 
be  used  as  in  the  first  illustration.  It  is  shown  in  Table  V.7. 

The  points  listed  in  Table  V.7.  are  plotted  in  Figure  V.12.  It 
will  be  seen  that  they  fall  in  rather  irregular  fashion,  and  that  at 
first  glance,  the  position  of  the  63.5  mile  per  hour  point  appears  to 
preclude  the  possibility  of  drawing  a  satisfactory  straight  line. 
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Figure  V.12 

Graph  Showing  Percentage  of  Vehicles  Traveling  Above 

and  Below  Various  Speeds  and  the  Probable  Amounts  of 

the  "Natural  Uncertainty"  of  the  Plotted  Points 


Table  V.7 


Speed  in  Miles 
Per  Hour 

Cumulated 
Frequency 
(f) 

Percent  Equal 
to  or  Slower 

Natural 
Uncertainty 
in  Percent 

20.5 

0 

0 

0.0 

25.5 

5 

5 

2.18 

30.5 

12 

12 

3.24 

35.5 

31 

31 

4.62 

40.5 

54 

54 

4.97 

45.5 

67 

67 

4.70 

50.5 

82 

82 

3.84 

55.5 

94 

94 

2.37 

60.5 

99 

99 

0.99 

63.5 

100 

100 

0.0 

65.5 

100 

100 

0.0 
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First,  however,  it  is  important  to  consider  the  probable  amounts 
of  the  "natural  uncertainty".  Recall  that  the  natural  uncertainty 


-N-* 


Z  =  1/  f0l  1  — -J.  This  natural  uncertainty  is  given  for  each  fre- 


quency in  the  last  column  of  the  table. 

If  the  percentage  of  cars  travelling  slower  than  a  given  speed 
or  equal  to  it  is  plotted  against  speed,  the  points  will  fall  in  an 
irregular  line.  This  is  to  be  expected,  particularly  when  the  number 
of  cars  represented  in  one  diagram  is  only  100.  If  counts  are  made 
a  number  of  times  under  precisely  the  same  conditions  of  traffic, 
the  percentage  traveling  faster  than,  say  40  miles  per  hour,  will 
never  be  exactly  the  same,  except  by  chance.  There  will  be  a 
certain  dispersion  around  the  average  value  for  several  groups  of 
100  cars.  This  we  have  already  referred  to  in  article  V.12.  as  a 
"natural  uncertainty". 

Through  each  plotted  point,  a  horizontal  line  is  drawn  represent- 
ing the  allowed  ±  range  in  the  value  of  f0.  It  is  then  permissible 
to  draw  a  smoothed  curve  in  such  a  way  that  it  passes  through  all 
the  horizontal  lines,  attempting  to  draw  it  so  that  the  sum  of  the 
deviations  from  the  actually  counted  values  shall  be  equal. 

In  the  present  case,  a  straight  line  satisfies  all  but  the  63.5  mile 
per  hour  point.  In  the  preceeding  formula,  f0  should  really  be  the 
mean  number  of  cars  with  velocity  equal  to  or  less  than  the  given 
amount,  found  from  a  great  number  of  sets  of  100  cars  under  the 
same  traffic  conditions.  In  such  cases,  it  is  fair  to  suppose  that  an 
occasional  car  traveling  faster  than  63.5  miles  per  hour  would  be 
found.  Then  the  actual  percentage  slower  than  63.5  would  be 
slightly  less  than  100.  If,  for  example,  it  were  99.5,  the  natural  un- 
certainty would  then  be  ±  0.7,  and  the  point  and  the  dotted  line 
would  give  the  result.  In  this  case,  it  is  evident  that  the  straight 
line  can  be  passed  through  all  the  horizontal  lines.  This  means 
principally,  that  the  points  given  by  the  higher  speeds  are  too 
erratic  and  sensitive  to  accidental  fluctuations  to  be  given  much 
weight  in  drawing  of  the  curve.  Probably  all  points  for  percentages 
less  than  2  and  greater  than  98  should  be  ignored  in  drawing  the 
curve. 
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That  the  "normal"  dispersion  pattern  describes  the  speed  range 
is  demonstrated  if  we  replot  some  of  the  speed  curves  shown  in 
Figure  5  of  the  Capacity  Manual.  These  curves  plotted  on  arith- 
metic probability  paper  are  very  nearly  straight  lines  as  shown  in 
Figure  V.13.,  where  the  distributions  for  traffic  volumes  of  600, 
1200,  and  1800  vehicles  per  hour  are  given. 
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Figure  V.13 
Typical  Speed  Distributions  at  Various  Traffic  Volumes 
on  Level,  Tangent  Sections  of  2-Lane,  High-Speed  Existing 

Highways 

Judging  from  these  examples  it  may  be  assumed  that  a  straight 
line  will  satisfy  the  data  and  that  the  "smoothed"  values  read 
from  the  curve  may  be  used  in  analysis. 


V.  17.  Estimating  Speeds  and  Volumes.  Having  determined  the  free 
speed  distribution  on  a  highway,  it  is  possible  to  estimate  the 
speed  at  greater  traffic  volumes. 
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The  first  step  is  to  find  the  average  difference  in  speed  between 
the  vehicles  being  passed  and  the  passing  vehicles.  The  rate  at 
which  the  faster  vehicles  are  overtaking  the  slower  ones  can  be 
found  from  a  speed  distribution  curve. (a)  Such  a  curve  is  shown  in 
Figure  V.14.  as  replotted  from  Figure  4,  page  30,  of  the  Capacity 
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Figure  V.14 

Frequency  Distribution  of  Travel  Speeds  of  Free  Moving 

Vehicles  on  Level,  Tangent  Sections  of  the  Majority  of 

Existing  2-Lane  Main  Rural  Highways 


Manual.  It  is  evident  that  there  are  just  as  many  vehicles  travel- 
ing above  the  average  (or  50  percentile  speed)  as  below  it.  The 
average  speed  differential  is  the  difference  between  the  average 
speed  of  the  50  per  cent  faster  vehicles  and  the  50  per  cent  slower 
vehicles.  The  average  of  the  50  per  cent  faster  vehicles  comes  at  the 
78.75  percentile,  and  the  average  of  the  50  per  cent  slower  vehicles 
comes  at  the  21.25  percent ile.(b) 

(a)  In  a  study  of  passing  made  in  19356,  it  was  found  that  vehicles  in 
the  act  of  passing  other  slower  vehicles  were  traveling  9  to  10  miles  per 
hour  faster.  The  Capacity  Manual  gives  9.6  miles  as  the  average  passing 
speed  differential.  (Footnote  continued  on  p.  183). 

(b)  This  can  be  proved  as  follows:  Let  Figure  V.  15  represent  the  same 
curve  as  Figure  V.  14.,  but  plotted  on  linear  cross  section  paper. 
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The  average  speed  of  the  faster  vehicles  equals  47.5  miles  per 
hour  and  the  average  for  the  slower  ones  is  37.5  miles  per  hour,  so 
that  the  average  difference  is  10  miles  per  hour. 


1  M 

U-X=798  0"-H 

Figure  V.15 

Determination  or  the  Mean  Abscissa  of  the  Upper  Hale  of 
the  Normal  Distribution  Curve  and  the  Area  to  the  Right 

of  this  Abscissa 


Required:  To  find  (1)  the  mean  abscissa  of  the  upper  half  of  the  normal 
distribution  curve,  and  (2)  the  area  to  the  right  of  this  abscissa. 


-    J" 

v  e/o 


x  = 


x  y  dx 


ft/0    ' 


y  dx 


if 


x  y  dx 


xe      2o8  dx 
2  tz  a  Jo 

=  1/  —  o,  which  is  about  =  .798  a. 

From  a  table  of  areas  under  the  normal  curve,  the  area  to  the  right  of 
.798  a  is  .2125,  or  21.25  per  cent  of  the  total  area.  In  other  words,  21.25% 
of  the  speeds  will  exceed  the  average  of  all  the  speeds  higher  than  the 
average  speed.  Similarly,  because  of  symmetry,  21.25%  of  the  speeds  less 
than  the  average  will  be  less  than  the  average  of  all  the  speeds  lower  than 
the  average  speed. 
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Having  found  the  average  speed  differential  we  next  find  the 
percentage  of  spaces  either  large  enough  or  too  small  to  permit 
passing. 

Assume  for  example  that  a  two  lane  road  is  carrying  800  vehicles 

per  hour  and  that  the  distribution  of  time  spaces  is  random  with 

3600 
the    average    spacing   m  = =  9  seconds,    (since    there    are 

400  vehicles  passing  a  point  every  hour  in  one  direction  or  every 
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Fig  toe  V.16 

Cumulative    Distribution    of    Time    Spaces    Assumed    for 

2-Lane  Road  Carrying  800  Vehicles  per  Hour 
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3600  seconds)  and  that  the  minimum  spacing  is  1/2  second.  The 
curve  for  the  distribution  is  shown  in  Figure  V.16. 

With  10  seconds  as  the  average  time  required  for  passing  we  find 
from  curve  V.16.  that  67  per  cent  of  the  spaces  are  too  small  for 
passing.  This  means  that  67  per  cent  of  the  time  a  driver  on  this 
highway  could  not  pass  because  of  vehicles  on  the  opposite  lane. 

This  concept  becomes  clear  if  we  keep  in  mind  that  at  any 
instant  the  chance  of  there  being  a  space  of  less  than  10  seconds 
of  free  space  on  the  opposite  lane  is  equal  to  the  percentage  of  the 
total  spaces  that  are  less  than  10  seconds.  In  this  sense  the  size  of 
the  time-gap  has  nothing  to  do  with  the  chance  of  its  being  oppo- 
site the  driver  at  any  particular  instant.  It  is  only  the  frequency  of 
the  occurrence  of  the  space  that  determines  the  probability  of  its 
happening  in  so  far  as  passing  is  concerned.  This  reasoning  be- 
comes clearer  if  we  remember  that  a  space  even  if  large  is  usually 
used  for  only  one  passing.  For  example  6  time  spaces  might  occupy 
50  seconds  with  one  equal  to  10  seconds  to  permit  one  passing  or 
one  of  the  spaces  might  be  25  seconds  and  still  permit  only  one 
passing  during  the  50  seconds.  (See  Article  V.23  for  mathematical 
solution.) 

If  a  driver  is  not  to  be  retarded,  he  must  every  time  he  approaches 
a  vehicle  ahead,  immediately  pass  the  leading  vehicle.  If  his  speed 
is  on  the  average  10  miles  an  hour  faster,  then  that  per  cent  of  the 
time  he  cannot  pass  is  the  per  cent  of  the  10  miles  per  hour  differ- 
ence that  he  must  lose.  In  the  present  instance  he  would  lose  67  per 
cent  of  10  miles  per  hour  or  6.7  miles  per  hour.  Subtracting  this 
from  the  43  miles  per  hour  average  speed  gives  36.3  miles  per  hour 
as  the  estimated  average  speed  if  the  volume  is  800  vehicles  per 
hour  for  two  lanes.  This  very  nearly  equals  the  observed  speed  of 
36  miles  per  hour  as  shown  in  the  lower  curve,  Figure  5,  page  31, 
of  the  Capacity  Manual.  This  result  would  indicate  that  this  method 
of  estimating  is  accurate  enough  to  give  good  design  figures.  As  a 
further  check  let  us  estimate  the  speed  for  1200  vehicles  per  hour 
for  two  lanes.  From  the  curve  shown  in  Figure  V.17.  we  find  that 
vehicles  are  prevented  from  passing  for  83  per  cent  of  the  time. 

The  speed  drop  is  thus  83  per  cent  of  10  miles  per  hour  =  8.3 
miles  per  hour.  Subtracting  this  from  43  =  34.7.  This  is  more  than 
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Figtjbe  V.17 

Cumulative    Distribution    of    Time    Spaces    Assumed    for 
2-Lane  Road  Carrying  1200  Vehicles  per  Hour 


the  observed  results  of  about  32  miles  per  hour  shown  in  Figure  5, 
page  31,  of  the  Manual. 

This  lack  of  agreement  needs  to  be  examined  to  see  if  there  is  an 
explanation.  According  to  the  theory  just  advanced  the  speed  drop 
due  to  inability  to  pass  cannot  exceed  the  average  speed  differential. 
How  can  we  account  for  a  speed  drop  greater  than  this  ?  The  logical 
conclusion  is  that  a  further  speed  drop  is  not  due  to  an  inability  to 
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pass  but  to  some  other  cause.  If  we  recall  that  there  is  a  speed  drop 
directly  proportional  to  spacing  the  reason  for  the  further  speed 
loss  becomes  clear.  With  a  volume  of  1200  vehicles  per  hour,  a 
high  percentage  of  vehicles  are  traveling  in  the  six  second  zone  of 
mutual  interference  and  are  slowed  because  they  are  too  close  to- 
gether rather  than  because  of  an  inability  to  pass. 

V.  18.  Estimate  of  Size  Gap  Required  for  Weaving.  It  is  impossible  to 
estimate  the  speed  drop  for  a  given  increase  in  volume  on  a  four- 
lane  road  without  knowing  the  time-gap  required  for  weaving.  But 
since  the  speed  drop  has  been  measured,  it  is  possible,  by  reversing 
the  method  just  explained,  to  estimate  the  time-gap  for  weaving. 
From  Figure  46,  page  122,  of  the  Capacity  Manual,  we  find  that 
at  1700  vehicles  per  hour,  the  distribution  between  lanes  is  equal. 
The  speed  on  both  lanes  at  this  point  should  be  the  same.  Referring 
to  Figure  7,  page  33,  of  the  Capacity  Manual,  the  speed  at  a  flow  of 
1700  vehicles  per  hour  is  about  41  miles  per  hour.  This  is  a  drop  of  7 
miles  per  hour.  Since  the  average  speed  differential  is  8.8  miles  per 
hour,  in  order  for  a  speed  decrease  of  7  miles  per  hour  to  take  place, 

7 
on  the  average  each  car  driver  would  be  retarded  — =79. 5  per  cent 

8.8 

of  the  time.  This  means  that  79.5  per  cent  of  the  spaces  on  the 
adjoining  lane  are  too  small  to  peimit  weaving.  From  Figure  10, 
page  41,  of  the  Capacity  Manual,  we  find  that  the  intersection  of 
the  1700  vehicles  per  hour  abscissa  and  the  79.5  per  cent  ordinate 
gives  3  seconds  as  about  the  time-gap  required  for  weaving.  This 
time-gap  compares  very  closely  indeed  with  the  average  weaving 
gap  of  3  seconds  as  found  by  Wynn  and  Gourlay10. 

V.  19.  Physical  Features  of  Highway.  Effect  on  Traffic  Flow.  Having 
discussed  the  interrelationships  of  the  characteristics  of  flow,  un- 
interrupted except  by  other  traffic,  the  next  step  is  to  find  what 
happens  if  the  flow  is  slowed  or  interrupted  by  physical  features  of 
the  highway.  Let  us  first  direct  our  attention  to  a  location  where 
passing  is  prohibited.  This  occurs  in  mountainous  or  hilly  country 
where  grades  or  restricted  sight  distances  prevent  passing. 

For  this  problem  assume  that  the  average  speed  differential  is 
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Distribution  of  Vehicles  Between  Traffic  Lanes  on  a 

4-Lane  Highway  During  Various  Hourly  Traffic  Volumes 
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Figure  V.19 
Frequency   Distribution   of  Time    Spacing   between    Suc- 
cessive Vehicles  Traveling  in  the  Same  Direction,  at 
Various  Traffic  Volumes  on  aTypical  4-Lane  Rural  Highway 

(Figure  10,  page  41,  and  Figure  46,  page  122,  "Highway  Capacity  Manual,'1  Used  by  permission 
of  Bureau  of  Public  Roads,  U.S.  Department  of  Commerce.) 
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9  miles  per  hour  and  that  it  is  required  to  estimate  the  time  loss 
due  to  a  stretch  of  highway  where  passing  cannot  take  place  for 
one  half  of  the  time.  Let  us  further  assume  that  the  volume 
is  600  vehicles  per  hour.  Reasoning  as  before,  that  a  driver  in  order 
not  to  lose  speed  must  be  able  to  pass  as  soon  as  he  approaches 
behind  a  slower  vehicle,  we  conclude  that  for  one  half  of  the  time 
he  must  sacrifice  the  speed  differential  between  his  own  speed  and 
that  of  the  slower  vehicle.  Thus  if  the  average  speed  differential  is 
9  miles  per  hour  the  speed  loss  in  this  case  would  be  \  X  9  =  4.5 
miles  per  hour.  To  this  loss  must  be  added  the  loss  due  to  an  ina- 
bility to  pass  because  of  vehicles  on  the  opposite  lane.  Proceding 
as  before,  for  a  volume  of  600  vehicles  per  hour  we  find  17  per 
cent  of  the  spaces  are  greater  than  the  10  seconds  required  for 
passing.  This  means  that  for  83  per  cent  of  the  time  that  there 
is  sufficient  sight  distance  to  pass,  the  passing  maneuver  is  pre- 
vented by  traffic  on  the  opposite  lane.  The  additional  speed  loss 
is  0.83  X  4.5  =  3.75.  Therefore,  the  total  speed  loss  is  equal  to 
4.5  +  3.75  =  8.25  miles  per  hour. 

V.  20.  Crossing  Streams  of  Traffic.  The  capacity  of  a  highway  or 
street  is  limited  by  delays  at  intersections.  The  basic  condition,  but 
not  the  simplest  to  analyze,  may  be  thought  of  as  the  intersecting  of 
2  two-lane  roads  without  any  traffic  control7.  Each  vehicle  under 
such  a  condition  crosses  during  a  gap  in  the  opposing  stream  of 
vehicles.  The  average  minimum  acceptable  time  gap  has  been 
measured  and  found  to  range  from  4.6  to  6  seconds  depending 
upon  the  type  of  intersection  with  the  average  being  4.8  seconds11. 
Mr.  Raff  calls  this  "minimum  acceptable  time-gap"  a  critical  lag 
and  correctly  defines  it  as  the  size  lag  which  has  the  property  that 
the  number  of  accepted  lags  shorter  than  L,  the  critical  lag,  is  the 
same  as  the  number  of  rejected  lags  longer  than  L.  In  other  words, 
the  acceptable  time  gap  is  just  as  likely  to  be  accepted  as  it  is  to 
be  rejected.  The  probability  that  it  will  be  accepted  is  thus  equal 
to  i 

The  chances  of  any  single  vehicle  being  delayed  at  an  inter- 
section can  be  deduced  in  the  same  manner  as  the  delay  in  passing 
by  saying  that  the  chance  of  crossing  depends  upon  the  probability 
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of  there  being  a  time-gap  of  sufficient  size  at  the  instant  the  ve- 
hicle approaches  the  crossing.  This  probability  depends  upon  the 
relative  frequency  of  gaps  and  not  upon  their  size.  Thus  if  75  per 
cent  of  the  gaps  are  as  large  or  larger  than  required  for  crossing, 
then  the  chance  of  being  able  to  cross  without  delay  is  75  per  cent, 
and  the  chance  of  being  delayed  is  25  per  cent.  With  this  reasoning, 
and  recalling  the  exponential  law  of  distribution  of  time-gaps,  the 
probability  of  being  delayed  would  be 

(1  —  e~m)  =  (1  —  e-m)  X  100  in  per  cent 

The  probability  of  not  being  delayed  would  equal 

e~m  =  (e-m)  x  100  in  per  cent 

where  m  is  the  average  size  of  time-gap  on  the  street  being  crossed. 

This  reasoning  applies  to  single  or  "first-in-line"  vehicles  for  a 

next-in-line  vehicle  has  to  wait  for  the  first  vehicle  to  clear  and 

hence  is  delayed  a  longer  time,  or  looking  at  it  in  a  different  way, 

has  a  greater  chance  of  being  delayed.  This  question  of  added  delay 

will  be  considered  later  in  Art.  V.25.  For  an  illustration  let  the 

traffic  on  the  main  highway  be  400  vehicles  per  hour.  The  fact 

that  it  is  moving  in  two  directions  is  immaterial.  For  our  purpose 

it  may  be  considered  to  all  be  in  one  direction.  The  average  spacing 

3600 
between  vehicles  on  the  main  highway  will  be   =  9  seconds. 

8        J  400 

Since  there  are  practically  no  spacings  below  1/2  second  the  dis- 
tribution of  spacings  will  be  approximately  that  shown  in  Figure 
V.16.  Recall  that  the  average  is  at  point  .368  on  the  per  cent  or- 
dinate. This  curve  shows  that  52  per  cent  of  the  spaces  are  greater 
than  6  seconds  and  48  per  cent  smaller. 

V.  21.  Mathematical  Determination  of  Vehicle  Delay  Time.  The 
problem  of  determining  the  proportion  of  time  that  a  vehicle  is  de- 
layed may  be  approached  by  a  more  rigorous  mathematical  ana- 
lysis. This  problem  along  with  other  related  problems  has  been 
solved  by  Mr.  W.  F.  Adams  in  examples  worked  out  in  connection 
with  his  paper,  "Road  Traffic  Considered  as  a  Random  Series."12 
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The  proportion  of  time  occupied  by  intervals  greater  than  t 
seconds,  according  to  Mr.  Adams,  is 

e-m(Nt  +  l)  V.21.1. 

wherein  "N"  equals  vehicles  per  second.  The  proof  is  as  follows : 

Consider  the  intervals  of  lengths  lying  between  t  and  t  -j-  dt,  and 

for  the  moment  assume  we  are  dealing  with  a  period  of  one  hour. 

In  one  hour  the  expected  number  of  intervals  greater  than  t  is, 

Te-m 

T  =  vehicles  per  hour.  This  is  basically  the  same  as  the  formula, 

100  e  M,  but  with  different  notation. 

Similarly,  the  expected  number  of  intervals  greater  than  t  -\-  dt 
is 

rpe-N  (t  +  dt)  _  rpe_  (Nt  +  Ndt) 

=  Te~m  e_Ndt  by  the  rule  for  addition  of  indices. 
The  number  of  intervals  of  lengths  between  t  and  t  +  dt  is 

=  Te-Nt_Te-me-  Ndt  =  Te-Nt(1  _  e_  Ndtj 

Expanding  e_Ndt  in  terms  of  Ndt, 

=  Te"Nt  (1  - 1  +  Ndt  -  N2dt2/2 !  +  N3dt3/3 !....) 
=  Te~  NtNdt.  Omitting  terms  in  dt2  and  higher  powers, 
=  TNe~mdt 
To  the  first  order  of  small  quantities,  the  length  of  all  such 
intervals  may  be  taken  as  t. 

The  time  occupied  by  these  intervals  is  therefore 

TNte~mdt  seconds 
The  time  occupied  by  all  intervals  greater  than  t  during  one 
hour  is  found  by  integrating  this  expression  between  limits  t  and 
infinity, 

=  TN  fte-Ntdt 

Integrating  by  parts,     |  udv  =  uv-   fvdu 
Put  u  =  t,   du  =  dt,   and  dv  =  e~Ntdt   so  that 

'    v- jVmdt=-e-m/N 


r2 
It 
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The  above  expression  then  becomes 

TN  [-  te"Nt/N  +JVNtdt/N]"  =  TN  [-  te"Nt/N  -  e~m/W 

Both  terms  are  zero  when  t  is  infinite,  so  that  the  number  of 
seconds  occupied  by  intervals  over  t  seconds  during  one  hour  be- 
comes 

TN(te-Nt/N  +  e"Nt/N2)  =  3600  N2(te-Nt/N  +  e"m/N2) 

=  3600e~Nt(Nt  +  1) 

Now  the  total  time  considered  is  3600  seconds,  so  that  the  pro- 
portion of  time  occupied  by  intervals  over  t  seconds  is 

e-m(Nt  +  i) 

Conversely,  the  proportion  of  time  occupied  by  intervals  less 
than  t  is 

l-e~m(Nt  +  1)  V.21.2. 

V.  22.  Graphical  Method  of  Determining  Proportion  of  Time  Occu- 
pied by  Time-Gaps  of  Given  Size.  The  time  occupied  by  time-gaps 
larger  (or  smaller)  than  any  given  value  may  be  determined  graphi- 
cally. This  is  possible  because  we  know  that  the  average  size  gap 
in  any  range  is  always  at  .368  or  the  36.8  percentile  point  of  the 
range. 

For  the  purpose  of  demonstration  let  it  be  required  to  find  the 
proportion  of  time  occupied  by  time-gaps  larger  than  6  seconds  in 
a  stream  of  traffic  of  600  vehicles  per  hour.  The  average  space  is 

3600 

equal  to  =  6  seconds.  This  average  is  at  the  36.8  percentile 

H  600  _1 

point  so  we  may  construct  the  curve    100  e     m  which  we  have 

already  discussed  by  selecting  several  values  for  S  to  get  values  for 

o 

—  (m  =  6)  to  give  points  on  the  curve.  The  curve  is  shown  in 
m 

Figure  V.20. 

The  average  spacing  is  6  seconds  at  36.8  percentile  point.  The 

average  for  the  spacings  greater  than  6  seconds  is  at  the  point  36.8 

per  cent  of  36.8  per  cent  or  13.5  per  cent.  The  corresponding  spacing 
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Figure  V.20 

Cumulative  Distribution  of  Time  Spaces  Assumed  for 
2-Lane  Road  Carrying  600  Vehicles  per  Hour 


is  12  seconds.  Thus,  the  average  of  all  spacings  is  6  seconds  and 
the  average  for  the  spacings  above  6  seconds  is  12  seconds.  There- 
fore, the  proportion  of  time  occupied  by  spacings  greater  than 
6  seconds  is  equal  to 

36.8  (per  cent)  X  1? 


100  (per  cent)  X  6 


=  .736 


=  73.6  per  cent 
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Using  the  formula  e"Nt  (Nt  +  1);  N  =  |,  t  =  6: 

e"m  (Nt  +  1)  =  e-1  (1  +  1)  =  .368  X  2 
=  .736  =  73.6% 

V.  23.  The  Average  Length  of  Alh  Intervals.  The  average  length  of 
all  intervals  greater  than  t  seconds  is  equal  to  the  total  time  greater 
than  t  seconds  divided  by  the  number  of  intervals  greater  than  t 
seconds,  i.  e., 

e"Nt(Nt  +  l)       /l         \ 

=  U-  +  t    seconds  V.23.1. 


Ne"Nt  \N 

Conversely,  the  average  length  of  interval  less  than  t  seconds  is 
equal  to  the  total  time  occupied  by  intervals  less  than  t  seconds 
divided  by  the  number  of  intervals  of  less  than  t  seconds,  i.e., 

l-em(Nt  +  1) 


N(l- 

e"m) 

1— Nte 

"Nt-e 

-Nt 

N(l- 

-e"m) 

1  — e" 

Nt 

Nte" 

-Nt 

N(l-e 

-Nt\ 

N(l- 

e-Nt) 

1         te 

-Nt 

V.23.2. 
N      1— e"Nt 

Having  determined  the  average  length  of  intervals  of  less  than 
t  seconds  it  still  remains  to  be  found  how  much  delay  these  inter- 
vals cause.  The  following  solution  is  given  by  Mr.  Adams : 
Solution : 

When  any  pedestrian  or  driver  arrives,  he  may  find 

(a)  that  no  vehicle  arrives  during  the  next  t  seconds.  The  prob- 
ability of  this  is  e~m  and  in  this  case  his  waiting  time  is  zero. 

(b)  that  a  vehicle  arrives  during  the  first  t  seconds,  but  none 
arrives  in  the  t  seconds  following  the  arrival  of  the  first 
vehicle.  The  probability  of  this  is  (1— e"Nt)  e"Nt  and  the 
waiting  time  is  one  interval. 
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(c)  that  the  first  two  intervals  after  his  arrival  are  each  less  than 

t  seconds,  but  the  third  is  greater  than  t.  The  probability  is 

(1  —  e~Nt)2  e~Nt  and  he  has  to  wait  for  two  intervals  each 

less  than  t  seconds. 

In  similar  manner  it  may  be  shown  that  the  probability  of  any 

driver  or  pedestrian  having  to  wait  for  n  intervals  each  less  than  t 

seconds  is 

(1_e-m)*e-m 
The  Expectation(a)  of  intervals  for  which  the  driver  or  pedestrian 
has  to  wait  is  given  by  the  series 

0e-Nt+  1  (l-e-m)  e"m+  2  (1  -e"Nt)2  e"m. . . 
=  e~m  {1  (1  -e~Nt)  +  2  (1  -e-m)2  +  3  (1  -e~m)3. . .  } 
Summing  the  series  in  brackets  to  inf  inity(b)  the  expected  number 
of  intervals  becomes 

e-Nt(1_e-m) 


(e-Nt)2 

l-e-m 

e"m 


V.23.3. 


The  average  length  of  the  intervals  of  less  than  t  seconds  as  al- 
ready found  is 

1  far"* 

seconds. 


N      l-e"Nt 

The  average  waiting  time  will  be  the  product  of  the  expected 
number  of  intervals  and  the  average  length  of  interval 
l_e"Nt      te-m(l-e-m) 


Ne-m        e-m(l-e-m) 

1  1 

1  V.23.4. 


Ne~m      N 

This  is  the  average  delay  to  all  drivers  or  pedestrians,  whether  each 
one  is  delayed  or  not.  However,  a  proportion  e~m  of  them  find  that 
the  first  vehicle  does  not  arrive  during  the  t  seconds  following  their 
own  arrival,  so  that  this  proportion  of  them  is  not  delayed  at  all. 

(a)  The  'Expectation'  of  an  event  which  may  at  each  trial  take  any  one 
of  a  number  of  possible  values  is  found  by  multiplying  each  of  the  possible 
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The  proportion  delayed  is  therefore 

(1  -  e"Nt) 

and  the  average  waiting  time  of  those  who  suffer  delay  is 

l/Ne"Nt-  1/N-t 


1 

-  e 

-Nt 

1 

Nt 

(i- 

-  e" 

Nt\ 

Ne" 

(i- 

-  e" 

-Nt\ 

1 

t 

(1  -e"Nt) 


Nt         i  ^-Nt 


Ne~m      1  -  e 


V.23.6. 


Mr.  Warren  S.  Quimby13  using  the  formula  in  a  modified  form, 
gives  the  delay  as 

3600                t 
Delay  = -^ -^  V.23.7. 

ve  3600       1  —  e~3600 

wherein  t  =  acceptable  time  gap  in  seconds 

v  =  number  of  vehicles  per  lane  per  hour 
e  =  base  of  Napierian  logarithms  =  2.71828. 
3600  =  number  of  seconds  in  one  hour. 

These  delays  are  for  a  single  vehicle  approaching  the  intersections. 
Mr.  Quimby  gives  a  comparison  of  the  theoretical  delay  with  the 
observed  delay  in  the  following  table : 


values  by  the  probability  of  its  occurrence  and  summing  the  resultant 
products.  It  represents  the  average  value  to  be  expected  from  a  large 
number  of  trials  (Cf.  Footnote  b.) 

(b)  Put  (1  —  e~Nt)  =  a  and  note  that  a,  being  a  probability,  must  be  less 
than  1. 

The  series  then  becomes 

a  +  2  a2  +  3  a3  +  4  a4  + nan  + 

The  sum  to  infinity  of  this  series  (see  Hall  and  Knight's  "Higher  Algebra" 
Chap.  V.,  section  60,  example  1)  is 

a/(l  —  a)2  =  (1  —  e-Nt)/(e"Nt)2 
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Table  V.  8 

Comparison  of  Theoretical  and  Field  Delays 
to  first  Vehicle  in  Line 
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Sample 

A 

B 

C 

D 

E 

F 

Theoretical  delay,  seconds 

6.60 

7.10 

6.91 

6.95 

7.04 

4.05 

Actual  delay,  seconds 

6.4 

6.2 

6.8 

8.0 

8.7 

4.4 

For  determining  the  percentage  of  vehicles  delayed,  Mr.  Quimby 
gives  the  following  formula : 

Per  cent  delayed  t=  1  -  e"  vt/3600+  (l  _  e-  ^'3600)T, 
wherein  the  terms  are  as  already  defined  with  the  exception  of  T 
which  is  the  probability  of  a  vehicle  arriving  in  any  given  time 
interval. 

Mr.  Quimby  states  that  this  formula  includes  a  consideration 
of  both  main  and  side  street  volumes  and  this  is  affected  by  a 
change  in  the  volume  on  either  street. 

The  following  table  compares  the  actual  with  the  theoretical 
delay : 

Table  V.9 

Comparison  of  Theoretical  and  Field  Observations 
of  Total  Traffic  Delayed 


Sample 

A 

B 

C 

D 

E 

F 

Main  street  volume 

568 

635 

606 

608 

627 

200 

Side  street  volume 

110 

115 

116 

123 

191 

181 

Per  cent  delayed  -  theory 

55.3 

60.7 

58.7 

59.3 

65.9 

16.0 

Per  cent  delayed  —  actual 

53.8 

55.0 

56.5 

59.2 

63.0 

14.6 

Another  researcher  to  use  a  rational  approach  to  this  same 
problem  is  Mr.  Morton  S.  Kaff11. 

All  cars  are  not  "first-in-line"  for  often  several  vehicles  are 
blocked  so  that  there  is  a  second,  a  third  and  so  on,  position  car. 
He  states  that  the  percentage  of  vehicles  delayed  as  given  by  the 
formula 

P  =  100(1  — e~NL) 
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is  too  small.  This  formula  will  again  be  recognized  as  the  same  one 
as  just  discussed  but  with  a  different  notation.  That  is  NL  =  Nt. 
In  this  formula  N  ;=  number  of  vehicles  on  main  street  and  L  = 
the  "lag."  In  order  to  take  account  of  this  sluggishness,  Mr.  Raff 
modifies  the  formula  and  arrives  at  the  following: 


=  100  J 


e-2.5Nse-2NL 


l_e-Miri(l-e-ll3L)J 

where 

P  =  Percentage  of  side  cars  delayed 

N  =  Main  Street  volume,  in  cars  per  second 
Ns  ==  Side-street  volume,  in  cars  per  second 

L  =  Critical  lag  in  seconds 

e  =  Base  of  natural  logarithm 
Mr.  Raff  states  an  examination  shows  that : 

1.  The  limit  of  P,  as  Ns  approaches  zero,  is  100  (1  —  e~NL), 
which  is  the  theoretical  formula.  In  other  words,  if  there  are 
no  side-street  cars,  there  is  no  sluggishness  effect. 

2.  P  always  exceeds  100(1—  e~NL),  except  when  Ns  equals 
zero.  In  other  words,  the  sluggishness  effect  delays  more  cars 
than  would  be  delayed  if  it  did  not  exist. 

3.  P  is  always  less  than  100  per  cent,  for  any  finite  volume. 

4.  The  partial  derivatives  of  P  with  respect  to  N,  N8,  and  L  are 
all  positive.  This  means  that  an  increase  in  either  of  the  two 
volumes  or  the  critical  lag  causes  an  increase  in  the  percent- 
age of  cars  delayed,  as  given  by  this  formula.11 

The  coefficient  of  N8  has  been  found  from  observed  delays  to  give 
values  close  to  actual  experimental  results.  For  the  theoretical 
development  of  the  formula  see  Mr.  Raff's  book. 

V.  24.  The  Signalized  Intersection.  The  signalized  intersection  pre- 
sents a  problem  that  is  different  from  that  where  there  is  no  con- 
trol or  only  a  stop  sign.  The  periods  for  crossing  are  at  fixed  inter- 
vals rather  than  at  random  as  are  the  openings  in  an  opposing 
stream  of  traffic.  Since  traffic  is  naturally  distributed  hapha- 
zardly, it  follows  that  any  fixed  time  signal  causes  unnecessary  de- 
lay. The  minimum  delay  follows  the  shortest  timing  interval  that 
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will  permit  all  the  waiting  vehicles  to  clear.  This  fact  is  easily  com- 
prehended if  we  think  of  a  very  long  timing  such  as  a  30  minute 
red  followed  by  a  30  minute  green  signal.  During  the  30  minute 
green  interval  on  one  street  there  would  be  no  delay  but  on  the 
other  street  all  traffic  appearing  at  the  intersection  during  the  long 
interval  would  be  blocked.  The  average  wait  would  thus  be  about 
15  minutes.  Obviously,  as  the  timing  is  decreased,  the  average 
waiting  time  decreases  until  such  time  as  the  traffic  fails  to  clear 
during  each  signal  change. 

The  two  fundamental  problems  in  signal  control  therefore  are  (1) 
finding  the  shortest  timing  that  will  not  cause  excessive  failures  to 
clear  the  waiting  traffic  and  (2)  determining  the  delay  caused  by 
the  fixed  timing. 

Perhaps  the  method  of  determining  the  chances  of  signal  failures 
to  clear  traffic  may  most  easily  be  explained  by  means  of  an  illus- 
trative solution. a 

Let  it  be  required  to  find  the  probability  of  the  cycle  failure  for 
395  vehicles  per  hour  on  each  lane  with  a  20  second  green  and  a 
20  second  red  signal  cycle.  Since  observations  have  shown  that 
usually  slightly  more  than  20  seconds  are  required  after  the  light 
changes  to  green  for  seven  vehicles  to  enter  the  intersection,  it 
will  be  assumed  that  the  cycle  will  fail  whenever  seven  or  more 
vehicles  appear  in  40  seconds.  40       „„_ 

The  average  number  of  vehicles  appearing  in  40  sec.  = 

=  4  •  4  =  m.  With  this  value  of  m,  the  probability  of  seven  or  more 
vehicles  appearing  in  40  sec.  (found  from  table)  equals  15.63  per 
cent.  Therefore,  the  traffic  signal  will  fail  to  clear  the  waiting 
traffic  15.63  per  cent  of  the  time. 

If  it  is  desired  to  reduce  the  per  cent  of  failures  to  say  5  per 
cent,  it  is  only  necessary  to  try  a  longer  cycle.  Two  or  three  trials 
will  usually  give  a  result  sufficiently  close.  The  method  is  one  of 
cut  and  try. 

(a)  This  treatment  of  the  signalized  intersection  is  abstracted  from: 
"Application  of  Statistical  Sampling  Methods  to  Traffic  Performance 
at  Urban  Intersections"  by  Bruce  D.  Greenshields,  (Proceedings  of  the 
Twenty- Sixth  Annual  Meeting),  The  Highway  Research  Board,  December, 
1946,  pp.  377-389. 
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For  a  second  trial,  let  us  try  a  25  second  green  —  25  second  red 

cycle.  The  average  number  of  vehicles  appearing  during  the  cycle 

50  X  395 

of  50  seconds  is  =  5.5  m.  Since  10  vehicles  will  cause  a 

3600 

failure,  the  percentage  of  the  time  that  10  or  more  will  appear  is 

read  from  the  Poisson  Table  as  .0537  or  5.37  per  cent. 

This  is  nearly  the  desired  answer  and  serves  to  illustrate  the  pro- 
cedure. If  a  more  accurate  result  is  wanted,  another  trial  could  be 
made. 

Any  signal  failure  will  affect  the  chances  of  a  succeeding  failure 
since  there  will  be  vehicles  left  over  from  the  first  cycle.  In  the 
present  example  with  a  20-20  signal,  the  second  signal  will  fail  if : 

1.  Seven  vehicles  arrive  during  the  first  and  six  or  more  during 
the  second  cycle. 

2.  Eight  vehicles  arrive  during  the  first  and  five  or  more  during 
the  second  cycle. 

3.  Nine  vehicles  arrive  during  the  first  and  four  or  more  during 
the  second  cycle. 

4.  Ten  vehicles  arrive  during  the  first  and  three  or  more  during 
the  second  cycle. 

5.  Eleven  vehicles  arrive  during  the  first  and  two  or  more  during 
the  second  cycle. 

6.  Twelve  vehicles    arrive   during  the  first  and  one  or  more 
during  the  second  cycle. 

If  the  probabilities  of  the  arrivals  of  the  vehicles,  as  found  in  the 
Poisson  tables,  are  multiplied  together  and  added  to  give  the  total 
probability,  the  result  is  as  follows : 


1.  .0778  X  .2800  =  .02178 

2.  .0428  X  .4488  =  .01921 

3.  .0209  X  .6405  =  .01338 

4.  .0092  X  .8149  =  .00750 

5.  .0037  X  .9337  =  .00345 

6.  .0013  X  .9877  =  .00128 


.06660 
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This  means  that  two  signals  will  fail  in  succession  6.66  per  cent 
of  the  time.  In  order  to  have  three  successive  failures,  there  would 
need  to  be : 

Thirteen  vehicles  in  the  first  two  cycles  and  six  or  more  in  the 

third, 
Fourteen  vehicles  in  the  first  two  cycles  and  five  or  more  in 

the  third, 
Fifteen  vehicles  in  the  first  two  cycles  and  four  or  more  in  the 
third,  etc. 

with  the  added  condition  that  there  be  seven  or  more  in  the  first 
cycle.  While  it  is  possible  as  just  shown  to  compute  the  probabilities 
for  these,  it  is  cumbrous.  Therefore  a  much  less  tedious  method 
that  gives  results  that  agree  closely  with  the  more  exact  procedure 
will  now  be  described. 

In  the  example  just  given  the  two  cycles  would  fail  in  succession 
if  13  or  more  vehicles  appeared  during  the  two  cycles,  provided 
that  seven  or  more  appeared  in  the  first  cycle. 

The  average  number  appearing  in  two  cycles  (80  sees.)  equals 

80  X  395 

=  8.8  =  m 

3600 

The  probability  of  13  or  more  appearing  in  the  two  cycles  is 
.1102  as  found  in  the  Poisson  tables  (4  places  is  considered  suffi- 
cient). 

The  average  flow  for  the  two  failing  cycles  is  not  eight,  the 
average  flow  on  the  roadway,  but  "13  or  more  vehicles".  If  it  were 
known  just  how  many  vehicles  "13  or  more"  amounts  to  it  would 
be  possible  with  this  value  of  m  to  determine  the  probability  of 
seven  or  more  vehicles  appearing  in  the  first  cycle.  The  next  step 
is  to  find  the  mean  value  of  "13  or  more".  Finding  the  arith- 
metical average  requires  extensive  multiplication,  but  the  mean 
value  can  be  found  very  quickly.  From  the  Poisson  table  it  is 
found  that  the  probability  of : 

13  or  more  vehicles  appearing  equals  0.1102 

14  or  more  vehicles  appearing  equals  .0642 

15  or  more  vehicles  appearing  equals  .0353 

16  or  more  vehicles  appearing  equals  .0184 
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17  or  more  vehicles  appearing  equals  .0091 

18  or  more  vehicles  appearing  equals  .0043 

19  or  more  vehicles  appearing  equals  .0019 

20  or  more  vehicles  appearing  equals  .0008 

The  mean  of  .1102  (the  probability  of  13  or  more  vehicles 
appearing)  is  .0551.  According  to  the  Poisson  table  above  the 
number  of  vehicles  corresponding  to  .0550  falls  between  14  and  15. 
The  values  from  the  table  above  are  plotted  on  semi-log  paper. 
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Note  that  the  points  fall  on  a  nearly  straight  line.  This  fact  makes 
it  possible  to  interpolate  between  14  and  15.  The  number  of  ve- 
hicles shown  on  the  abscissa  corresponding  to  0.0551  is  equal  to 
approximately  14.3  which  is  the  mean  of  "13  or  more"  for  the  two 
cycles  or  approximately  7.15  for  one  cycle.  With  this  new  m  the 
probability  of  seven  or  more  vehicles  appearing  in  the  first  cycle 
is  equal  to  0.5939. 
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The  probability  of  the  two  cycles  failing  is  equal  to  the  probab- 
ility of  there  being  13  or  more  in  the  two  cycles  multiplied  by  the 
probability  of  there  being  seven  or  more  in  the  first  cycle  or  0.1102 
X  .5939  =  0.0654.  This  may  be  compared  with  the  correct  value 
of  .0666. 

The  probability  of  three  cycles  failing  in  succession  would  be 
equal  to  the  probability  of  19  or  more  vehicles  appearing  in  three 
cycles  times  the  probability  of  13  or  more  in  two  cycles  (with  m 
equal  to  ^-),  times  the  probability  of  seven  or  more  in  the  first  cycle. 

V.  25.  Calculating  Delay  at  Signalized  Intersections.  It  is  possible  to 
calculate  the  delay  at  a  signalized  intersection  by  first  finding  the 

probability  of  retarding  1,  2,  3 n  vehicles,  and  then  computing 

the  average  delay  for  the  first,  second,  third,  etc.  vehicles  in  line. 
The  theoretical  method  of  doing  this  is  explained  in  "Traffic  Per- 
formance at  Urban  Street  Intersections",7  pages  91-94,  but  the 
procedure  is  too  tedious  to  be  practical.  A  method  that  is  prac- 
tical is  described  in  this  same  reference  pages  95-97,  and  100. 

V.  26.  Practical  Method  for  Determining  Number  of  Vehicles  Retarded 
at  the  Signalized  Intersection:  Before  determining  the  delay  per 
light  cycle,  it  is  necessary  to  ascertain  the  number  of  vehicles  re- 
tarded. The  proportion  of  vehicles  retarded  is  greater  than  the 
proportion  of  the  red  signal  to  the  entire  cycle,  since  each  re- 
tarded vehicle  in  effect  increases  the  blocking  period.  The  exact 
extent  to  which  this  occurs  has  been  measured. 

For  the  first  vehicle  to  arrive  at  the  intersection  the  potential 
blocking  period  is  equal  to  the  red  interval  R  of  the  signal,  though 
it  may  not  experience  the  full  potential  if  it  arrives  after  the  be- 
ginning of  the  red  interval.  The  second  vehicle,  if  it  is  not  stopped, 
may  not  follow  closer  on  the  average  than  1.7  seconds  behind  the 
first  vehicle  which  enters  3.8  seconds  after  the  light  changes  to 
green.  The  blocking  period  for  the  second  vehicle  therefore  is 
R  +  3.8  +  1.7  =  R  +  5.5  seconds. 

The  second  vehicle  enters  3.1  seconds  after  the  first,  so  that  the 
potential  blocking  period  for  the  third  vehicle  becomes 
R  +  3.8  +  3.1  +  1.7  ==  R  +  8.6  seconds. 


204 


STATISTICS  AND  HIGHWAY  TRAFFIC  ANALYSIS 


Similarly  the  potential  blocking  period  for  the  fourth  vehicle 
equals       R  +  3.8  +  3.1  +  2.7  +  1.7  =  R  +  11.3  seconds 

In  general,  the  potential  blocking  period  is  obtained  by  adding 
to  the  signal  interval  the  additional  delay  interval  caused  by  the 
preceding  vehicles  plus  1.7  seconds. 

The  additional  blocking  periods  created  when  various  number  of 
vehicles  are  retarded  is  shown  in  Figure  V.22  taken  from  page  96 
of  Traffic  Performance  at  Urban  Street  Intersections.7 

As  an  illustrative  example,  let  it  be  required  to  find  the  average 
number  of  vehicles  retarded  for  a  traffic  volume  of  228  vehicles  per 
hour  on  a  single  lane  with  the  signal  set  for  30  second  go  and  20 
second  stop.  The  average  number  of  vehicles  arriving  during  the 
20  second  red  period  is  1.27  vehicles  [(20  X  228)/3600].  (This 
might  be  approximately  one  for  each  of  three  cycles  and  two  for 
the  fourth  cycle.)  As  explained,  these  1.27  vehicles  tend  to  in- 
crease the  effective  length  of  the  red  signal.  Reference  to  Figure 
V.22.  shows  that  1.27  vehicles  increase  the  blocking  period  by 
about  6.4  seconds.  The  blocking  period  may  now  be  considered  to 
be  26.4  seconds  (20  +  6. 4).  A  26.4  second  blocking  period,  how- 
ever, will  retard  about  1.67  vehicles,  [(26.4  X  228)/3600]. 

The  increase  of  the  blocking  period  due  to  1.67  vehicles  is  7.7 
seconds  and  the  blocking  period  is  now  estimated  to  be  27.7  seconds. 
During  the  27.7  seconds  of  blocking  period  1.75  vehicles  will  be 
retarded  to  increase  the  estimate  of  the  blocking  period  to  27.95 
seconds.  By  further  successive  approximation,  the  number  of  ve- 
hicles retarded  can  be  obtained  with  any  degree  of  accuracy  de- 
sired. This  information  may  be  shown  in  tabular  form : 

Table  V.10.  Average  Number  of  Vehicles  Stopped  with  228 
Vehicles  per  Hour  per  Lane  and  20  Second  Red  Period 


Length  of 

Average  No.  of 

Blocking  Period 

Vehicles  Retarded 

1st  Approximation 

20        seconds 

1.27 

2nd 

26.4 

1.67 

3rd 

27.7 

1.75 

4th 

27.95 

1.77 

5th 

28 

1.77 
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Figure  V.22.   Additional  Blocking  Periods  Created  when 
Various  Numbers  of  Vehicles  are  Retarded 
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For  this  particular  example  it  seems  sufficiently  accurate  to  use  an 
average  of  1.77  vehicles  per  red  signal.  This  shows  that  with  a 
volume  of  228  vehicles  per  hour  per  lane  a  20  second  red  interval 
becomes,  in  effect,  a  28  second  blocking  period. 

V.  27.  The  Average  Arrival  Method  of  Determining  Delay.  A  practical 
method  of  calculating  the  time  loss  for  a  given  number  of  vehicles 
stopped  is  based  upon  an  assumption  as  to  the  arrival  time  of  the 
first  vehicle.  The  method  may  be  illustrated  as  follows: 

Let  the  red  interval  be  30  seconds.  It  is  assumed  that  the  first 
vehicle  will  arrive  on  the  average  at  the  mid-point,  wait  15  seconds, 
and  it  will  lose  3.8  seconds  in  entering  the  intersection.  To  this  is 
added  another  two  seconds  lost  in  accelerating  to  a  speed  of 
15  miles  an  hour,  giving  a  total  loss  of  20.8  seconds.  (The  accelera- 
tion loss  would  be  greater  for  higher  speeds).  The  total  loss  (using 
symbols)  is 

R 

-+  3.8  -fa 
2 

wherein  R  equals  the  red  interval  and  a  the  acceleration  loss  for 
a  given  normal  traveling  speed.  The  second  vehicle  arrives  on  the 
average  at  the  mid-point  of  the  stop  period  of  R  +5.5,  and  leaves 
at  R  +  6.9.  The  time  loss  is  equal  to 

(R+5.5) 

R  +  6.9  — — -  +  1  =  20.15  seconds 

2 

wherein  1  is  a  the  acceleration  loss. 
The  loss  for  the  third  vehicle  is : 

(R +  3.8  + 3.1  +  1.7) 

R  +  9.6  -- — - ;  +  a 

2 

(30  +3.8  +  3.1  +  1.7) 
=  39.6  - ! +  1  p=  21.3  seconds 

2 

The  loss  for  the  fourth  vehicle  is : 

(R  + 9.6 +  1.7) 
R  +  12  -  - — -  ==  21.35  seconds. 

2 

No  acceleration  loss  is  added  for  the  fourth  vehicle  since  it  has 
reached  normal  speed  by  the  time  it  enters  the  intersection. 
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By  following  this  method  the  delay  for  any  number  of  vehicles 
retarded  may  be  calculated,  but  it  is  only  the  method  that  is  of 
interest  to  us  here.  According  to  the  reference  just  mentioned  the 
observed  delay  agrees  very  closely  with  that  calculated.  The  delay 
occurring  in  traffic  with  various  proportions  of  trucks,  street  cars, 
and  other  types  of  vehicles  needs  to  be  observed  to  obtain  more 
accurate  and  representative  field  constants. 

V.  28.  Rare  Events  (Accidents).  There  are  many  events  in  traffic 
that  are  comparatively  rare.  This  is  particularly  true  of  certain 
types  of  accidents.  Taken  as  a  whole,  traffic  accidents  exact  a 
high  toll  in  lives  and  property  but  the  average  driver  is  rarely 
involved  in  a  serious  mishap.  Problems  involving  rare  events  may 
be  analyzed  by  the  Poisson  distribution  which  is  also  known  as  the 
law  of  small  chances. 

One  study  that  made  use  of  the  law  was  conducted  by  Dr.  H.M. 
Johnson14.  He  examined  the  accident  histories  of  29,531  Connecti- 


Table  V.ll 
Actual  and  Expected  Distribution  of  Accidents,  Including 
Casualties  and  Property  damage  exceeding  $25,  Reported 
to  the   Commissioner  of  Motor  Vehicles  of  Connecticut, 
1931-36,  in  a  Licensed  Driver  Sample  Selected  at  Random. 


Accidents  per 

Operators  having  these  accidents 

operator  during 
experience 

Actual 
number 

Expected 
number 

Difference 

0 

1 

2 

3 

4 

23,881 

4,503 

936 

160 

33 

14 

3 

1 

23,234 

5,572 

668 

53 

)  • 

647 
—1,069 

268 
107 

5 

6 

7 

)     - 

Totals 

29,531 

29,531 

0 

Note:  The  probability  that  the  differences  between  the  actual  and  expected  distributions 
s  due  to  chance  =  1/6  (10)-1",  which  is  insignificant. 


208         STATISTICS  AND  HIGHWAY  TRAFFIC  ANALYSIS 

cut  drivers  selected  at  random,  each  of  whom  had  been  licensed  for 
the  period  1931-1936. 

Among  these  29,531  drivers  there  accrued  7,082  accidents  which 
involved  5,650  operators,  Mr.  Johnson  found  that  the  accidents 
were  not  distributed  among  the  drivers  according  to  the  law  of 
chances  for  which  the  sole  parameter  is  the  rate  per  operator.  He, 
therefore,  concluded  that  some  operators  were  accident  prone  for 
some  reason  that  could  only  be  determined  experimentally. 

The  table  shows  the  actual  accidents,  the  expected  number  as 
calculated  from  the  Poisson  distribution  and  the  difference  be- 
tween the  theoretical  and  the  actual  number. 

It  may  be  noted  that  there  are  more  accident-free  drivers  than 
accounted  for  by  the  laws  of  chance  and  also  more  repeaters  with 
a  corresponding  deficiency  of  drivers  having  a  moderate  accident 
rate. 

Mr.  Johnson  found  among  other  things,  that  drivers  who  were 
under  16-20  years  old  at  the  beginning  of  the  experience  and  under 
22-27  years  old  at  its  close  had  1.47  times  as  many  of  the  non- 
personal  accidents  as  they  would  have  if  the  distribution  of  acci- 
dents were  independent  of  age.  That  this  difference  is  not  acci- 
dental, according  to  Mr.  Johnson,  is  evidenced  by  the  fact  that 
the  probability  of  the  independency-hypothesis  being  true  is  less 
than  10"24. 

The  significance  of  Mr.  Johnson's  report  is  that  it  demonstrates 
the  use  of  the  Poisson  distribution  in  studying  rare  events.  Sup- 
pose that  one  wishes  to  know  whether  a  driver  having  3  accidents 
in  6  years  is  an  accident-prone  driver.  According  to  Mr.  Johnson's 
figures  the  average  for  all  drivers  is 

7082 

.2398  ==  .24  accidents  =  m. 


29531 

With  this  value  of  m  we  find  from  a  Poisson  distribution  table 
that  the  probability  of  a  driver  having  3  accidents  is  .0018  or  .18 
per  cent.  This  means  that  the  chances  are  100  to  .18  or  approxi- 
mately 550  to  1  against  an  average  driver's  having  3  accidents.  We 
may  conclude,  therefore,  that  a  driver  who  has  this  many  mishaps 
is  a  bad  risk. 
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V.  29.  Rare  Events  (Accidents  at  Intersections).  Washington,  D.  C. 

has  a  total  of  7,683  intersections  open  to  traffic.  During  the  year 

1950  there  were  6,211  accidents  at  intersections.    Suppose  it  is 

desired  to  know  how  many  accidents  at  an  intersection  make  it 

accident  prone. 

6211 

The  average  number  of  accidents  = =  .8  ==  m.  According 

7683 

to  the  Poisson  distribution,  the  probabilities  of  accidents  occurring 

at  an  intersection  are  as  follows : 

Table  V.12 

Number  of  Accidents  Probability 


2 

.0438 

3 

.0383 

4 

.0077 

5 

.0012 

3  or  more 

.0474 

4  or  more 

.0091 

5  or  more 

.0014 

Suppose  that  it  is  decided  that  when  the  odds  are  20  to  1  that 
the  accidents  occurring  are  not  due  to  chance  alone,  an  inter- 
section is  to  be  considered  accident  prone.  According  to  the  table, 
3  or  more  accidents  will  occur  due  to  chance  4.74  per  cent  of  the  time, 
This  ratio  of  one  to  .0474  is  over  20  to  1,  hence  an  intersection 
having  over  3  accidents  would  be  considered  unduly  hazardous. 

Records  are  not  available  as  to  the  distribution  of  intersections 
having  less  than  5  accidents,  but  of  those  with  five  or  more  it  is 
possible  to  compare  the  actual  occurrence  of  accidents  with  the 
number  expected  to  occur  according  to  the  Poisson  distribution. 
See  Table  V.  13. 

This  procedure  is  presented  to  illustrate  a  method  of  approach 
and  not  as  a  suggested  analysis,  for  obviously  the  records  should 
be  much  more  complete.  Clearly  the  volume  of  traffic  is  one  of  the 
most  important,  if  not  the  most  important,  factor. 

V.  30.  Size  of  Sample  to  Determine  Average  Number  of  Car  Passen- 
gers. In  making  a  traffic  survey  it  is  required  to  know  the  average 
number  of  persons  per  car.  The  problem  is  to  determine  the  size 
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Table  V.13.   Number  of  Intersections  in  Washington, 

D.C.  AT  WHICH  5  OR  MORE  ACCIDENTS  OCCURRED  IN  1950 


Number  of 

Number  of 

Number  of 

n^ntnl  7V T iimhpT 

Intersections 

Intersections 

Accidents 

-L   \JV\Aiv    IV    t*/ tv\J\sl 

of  Accidents 

Expected  to  have 

having  Accidents 

Per  Intersection 

Number  of  Accidents 

Shown  in  Col.  2 

85 

5 

425 

27 

68 

6 

408 

40 

76 

7 

532 

60 

55 

8 

440 

55 

22 

9 

198 

54 

32 

10 

320 

47 

12 

11 

132 

38 

10 

12 

120 

28 

7 

13 

91 

19 

5 

14 

70 

12 

9 

15 

135 

7 

4 

16 

64 

4 

4 

17 

68 

2 

4 

18 

72 

1 

3 

19 

57 

Less  than  1 

5 

20 

100 

2 

21 

42 

1 

22 

22 

1 

23 

23 

1 

27 

27 

1 

28 

28 

1 

32 

32 

1 

37 

37 

1 

45 

45 

1 

64 

64 

1 

86 

86 

412 

3638 

Note:  In  this  case,    m 


.8.  The  last  column,  Number  of  Intersections  Expected  to 


3638 
412 

have  Number  of  accidents  shown  in  Column  2,  can  be  obtained  by  multiplying  the  probabilities 
of  occurrence  taken  directly  from  "Poisson  Exponential  Binomial  Limits,"'  by  412,  the  total 
number  of  intersections.  It  may  also  be  obtained  from  Appendix  Table  No.  VI,  page  226.  This 
table  gives  the  probability  of  x  or  more  events  occurring  during  a  given  interval,  when  m,  the 
average  number  of  events  per  interval  is  known.  In  using  Table  VI,  the  probability  that  x, 
a  specific  number  of  events  will  occur,  is  equal  to  the  difference  between  the  probabilities  of 
x  or  more  and  (x  + 1)  or  more  events  occurring.  In  the  above  table,  the  pure  chance  probability 
of  5  accidents  occurring  at  an  intersection  is  the  difference  in  probability  of  5  or  more  and  6  or 
more  accidents  occurring.  Multiplying  this  difference  by  the  total  number  of  intersections  gives 
the  number  of  intersections  expected  to  have  5  accidents.  Referring  again  to  Table  VI,  0.872 
(the  probability  that  6  or  more  accidents  will  take  place)  subtracted  from  0.938  (the  probability 
that  5  or  more  accidents  will  take  place)  leaves  0.066  or  6.6%.  Multiplying  412  by  6.6%  gives 
27,  the  number  of  intersections  that  may  be  expected  to  have  5  accidents. 
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of  sample  to  give  a  95  per  cent 

i  assurance  that  the 

mean  value  will 

not  be  in  error  more  than  0.1. 

Suppose  that  the  following 

typical 

occupancy 

count  has  been 

made: 

Occupants  (x) 

Number  of  Observations  (f ) 

1 

15 

2 

10 

3 

4 

4 

2 

5 

1 

Mean  =  X  =  1.9 

N  =  32 

The  standard  deviation  s  is  first  calculated  and  found  to  be  1.054. 
From  formula  IV. 7. 3. 

N— 1  __  s2  _  (1.054)2  _  1.11  _ 
"t2         e2  ~~     (.1)2     ~~  ToT  ~~ 

From  Appendix  Table  3,  Ratio  of  Degrees  of  Freedom  to  (t2),  we 

find  that  with  a  probability  level  of  5  per  cent  (95  per  cent  assur- 

s2 
ance)  that  for  N  -  1  =  400,  that  — 0  t=  103.069  and  f or  N  -  1 

2  e 

s'2 

=  500,    -=  =  128.836.  Since  111  lies  between  these  two  values  we 
e2 

conclude  that  the  size  of  sample  required  is  between  400  and  500, 

and  if  we  wish  to  be  conservative  we  take  the  higher  value.  Also 

it  would  have  been  better  to  have  taken  a  larger  (preliminary) 

sample  to  obtain  the  trial  standard  deviation. 

V.  31.  Size  of  Sample  Required  in  Speed  Study.  It  is  desired  to  know 
the  average  speed  on  each  block  within  one  mile  per  hour  on  a 
street  with  60  intersections.  It  is  also  desired  that  there  be  a  95 
per  cent  assurance  as  to  the  result.  It  is  assumed  that  the  speed 
will  vary  with  the  volume  of  traffic,  the  weather,  the  amount  of 
parking,  and  perhaps  other  conditions.  The  problem  is  to  find  the 
required  size  of  sample  and,  having  determined  this,  to  recom- 
mend a  method  of  making  the  observations  that  will  yield  a  truly 
random  sample. 
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The  logical  procedure  is  to  take  a  random  sample  of  about 
100  observations  in  order  to  obtain  an  estimated  standard  devia- 
tion to  be  used  in  determining  the  size  of  sample.  Suppose  from 
this  sample  that  it  is  found  that  the  speed  range  is  from  5  to 
40  miles  per  hour  and  that  the  standard  deviation,  s,  equals 
4.5  miles  per  hour. 

We  use  the  t-distribution  to  find  the  size  of  sample.  From  for- 
mula IV.7.3. 

N-l      s2 


t2         e2 

we  find  the  ratio  of  N  —  1  to  t2  by  inserting  the  values  for  s  and  e. 
The  standard  deviation  s  in  the  present  example,  as  found  from 
the  preliminary  sample,  is  4.5  miles  per  hour  and  the  allowable 
error  is  one  mile  per  hour. 

N— 1       s2       (4.5)2      20.25 

Hence, —  =  —  =  —-!-  = =  20.25 

t2  e2         l2  1 

From  a  table  of  ratio  of  degrees  of  freedom  to  t2  we  find  that 

N-l 
with  a  probability  level  of  5  per  cent  that  a  ratio  of  =  20.202 

corresponds  to  N  —  1  =80  and  22.727  corresponds  to  N  —  1  =  90. 
Therefore,  we  conclude  that  N,  the  size  of  sample,  lies  between  81 
and  91.  To  be  on  the  safe  side,  we  may  say  that  a  sample  of  100 
observations  will  give  us  at  least  a  95  per  cent  assurance  that 
the  average  speed  will  be  obtained  within  ±  1  mile  per  hour.  If 
a  99  per  cent  assurance  is  desired  the  size  of  sample  according  to 
the  table  would  be  between  100  and  200. 

The  next  phase  of  the  problem  is  that  of  getting  a  truly  random 
sample.  Obviously  taking  all  the  speeds  on  a  day  of  light  traffic 
would  give  a  biased  result.  Clearly  there  must  be  some  knowledge 
of  the  relative  duration  of  the  various  conditions  that  influence 
speeds.  Increasing  the  size  of  the  sample  so  that  observations  might 
be  distributed  over  a  greater  number  of  hours  of  the  day,  more 
days  of  the  week  and  more  months  of  the  year  would  assure  a 
better  estimate  of  the  speed.  Increasing  the  size  of  sample  to  200 
should  give  sufficient  coverage. 
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Since  the  speed  is  desired  for  each  block  it  is  necessary  that 
observations  be  taken  in  each  block.  Some  accurate  mechanical 
device  that  is  free  from  human  errors  is  always  preferable.  This, 
however,  would  require  either  60  recording  devices  or  a  rotation 
of  a  lesser  number.  Since  they  would  give  "spot"  checks  they 
would  also  need  to  be  rotated  to  different  positions  in  the  blocks. 

Another  way  would  be  to  have  an  observer's  car  '"float"  with 
the  traffic.  The  observer  as  well  as  recording  speed  could  also  note 
pertinent  information  such  as  the  amount  of  parking.  Manual  re- 
cording could  be  supplemented  or  replaced  by  some  mechanical 
device  such  as  taking  a  picture  of  the  conditions  in  each  block  and 
including  in  the  picture  a  clock  to  show  the  time  of  reaching  each 
intersection.  The  cost  of  such  pictures  taken  on  16  mm  film  would 
be  negligible. 

The  particular  method  to  be  employed  in  this  or  any  other  pro- 
blem involving  the  collection  and  analysis  of  data  should  be  se- 
lected by  the  engineer  in  charge  after  he  has  made  a  preliminary 
study  of  both  the  nature  of  the  data  and  the  reliability  and  cost  of 
the  various  possible  methods  of  conducting  the  field  study.  Sta- 
tistics is  merely  an  aid  to  the  engineer  and  not  a  substitute  for 
experience  and  judgment. 
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APPENDIX  Table  I 
Areas  Under  the  Normal  Probability  Curve 

From  the  Mean  to  Distances  —  from  the  Mean,  Expressed  as  Decimal 

o 

Fractions  of  the  Total  Area  1.0000 
The  proportional  part  of  the  curve  included  between  an  ordinate  erected 
at  the  mean  and  an  ordinate  erected  at  any  given  value  on  the  X  axis  can 
be  read  from  the  table  by  determining  x  (the  deviation  of  the  given  value 

x  — 

from  the  mean)  and  computing  -.  Thus  if  X  =  $25.00,    a  =  $4.00,   and 

G 

it  is  desired  to  ascertain  the  proportion  of  the  area  under  the  curve  between 

ordinates  erected  at  the  mean  and  at  $20.00;  x  =  $5.00  and  -  =  ,_  .' 

c        S4.00 

=  1.25.  From  the  table  it  is  found  that  .3944,  or  39.44  per  cent,  of  the 
entire  area  is  included. 


X 

a 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

0.0 

.0000 

.0040 

.0080 

.0120 

.0159 

.0199 

.0239 

.0279 

.0319 

.0359 

0.1 

.0398 

.0438 

.0478 

.0517 

.0557 

.0596 

.0636 

.0675 

.0714 

.0753 

0.2 

.0793 

.0832 

.0871 

.0910 

.0948 

.0987 

.1026 

.1064 

.1103 

.1141 

0.3 

.1179 

.1217 

.1255 

.1293 

.1331 

.1368 

.1406 

.1443 

.1480 

.1517 

0.4 

.1554 

.1591 

.1628 

.1664 

.1700 

.1736 

.1772 

.1808 

.1844 

.1879 

0.5 

.1915 

.1950 

.1985 

.2019 

.2054 

.2088 

.2123 

.2157 

.2190 

.2224 

0.6 

.2257 

.2291 

.2324 

.2357 

.2389 

.2422 

.2454 

.2486 

.2518 

.2549 

0.7 

.2580 

.2612 

.2642 

.2673 

.2704 

.2734 

.2764 

.2794 

.2823 

.2852 

0.8 

.2881 

.2910 

.2939 

.2967 

.2995 

.3023 

.3051 

.3078 

.3106 

.3133 

0.9 

.3159 

.3186 

.3212 

.3238 

.3264 

.3289 

.3315 

.3340 

.3365 

.3389 

1.0 

.3413 

.3438 

.3461 

.3485 

.3508 

.3531 

.3554 

.3577 

.3599 

.3621 

1.1 

.3643 

.3665 

.3686 

.3718 

.3729 

.3749 

.3770 

.3790 

.3810 

.3830 

1.2 

.3849 

.3869 

.3888 

.3907 

.3925 

.3944 

.3962 

.3980 

.3997 

.4015 

1.3 

.4032 

.4049 

.4066 

.4083 

.4099 

.4115 

.4131 

.4147 

.4162 

.4177 

1.4 

.4192 

.4207 

.4222 

.4236 

.4251 

.4265 

.4279 

.4292 

.4306 

.4319 

1.5 

.4332 

.4345 

.4357 

.4370 

.4382 

.4394 

.4406 

.4418 

.4430 

.4441 

1.6 

.4452 

.4463 

.4474 

.4485 

.4495 

.4505 

.4515 

.4525 

.4535 

.4545 

1.7 

.4554 

.4564 

.4573 

.4582 

.4591 

.4599 

.4608 

.4616 

.4625 

.4633 

1.8 

.4641 

.4649 

.4656 

.4664 

.4671 

.4678 

.4686 

.4693 

.4699 

.4706 

1.9 

.4713 

.4719 

.4726 

.4732 

.4738 

.4744 

.4750 

.4758 

.4762 

.4767 

2.0 

.4773 

.4778 

.4783 

.4788 

.4793 

.4798 

.4803 

.4808 

.4812 

.4817 

2.1 

.4821 

.4826 

.4830 

.4834 

.4838 

.4842 

.4846 

.4850 

.4854 

.4857 

2.2 

.4861 

.4865 

.4868 

.4871 

.4875 

.4878 

.4881 

.4884 

.4887 

.4890 

2.3 

.4893 

.4896 

.4898 

.4901 

.4904 

.4906 

.4909 

.4911 

.4913 

.4916 

2.4 

.4918 

.4920 

.4922 

.4925 

.4927 

.4929 

.4931 

.4932 

.4934 

.4936 

2.5 

.4938 

.4940 

.4941 

.4943 

.4945 

.4946 

.4948 

.4949 

.4951 

.4952 

2.6 

.4953 

.4955 

.4956 

.4957 

.4959 

.4960 

.4961 

.4962 

.4963 

.4964 

2.7 

.4965 

.4966 

.4967 

.4968 

.4969 

.4970 

.4971 

.4972 

.4973 

.4974 

2.8 

.4974 

.4975 

.4976 

.4977 

.4977 

.4978 

.4979 

.4980 

.4980 

.4981 

2.9 

.4981 

.4982 

.4983 

.4984 

.4984 

.4984 

.4985 

.4985 

.4986 

.4986 

3.0 

.49865 

.4987 

.4987 

.4988 

.4988 

.4988 

.4989 

.4989 

.4989 

.4990 

3.1 

.49903 

.4991 

.4991 

.4991 

.4992 

.4992 

.4992 

.4992 

.4993 

.4993 

3.2 

.4993129 

3.3 

.4995166 

3.4 

.4996631 

3.5 

.4997674 

3.6 

.4998409 

3.7 

.4998922 

3.8 

.4999277 

3.9 

.4999519 

4.0 

.4999683 

4.5 

.4999966 

5.0 

.4999997133 

Used 
Applied 


by  permission  of  Houghton  Mifflin  Company,  publishers  of  Rugg's  "Statistical  Methods 
to  Education" . 
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APPENDIX  Table  II 
Table  of  Values  of  t 
For  Given  Degrees  of  Freedom,  (n)  and  at  Specified  Levels  of  Significance  (P) 
In  the  use  of  this  table  it  is  to  be  remembered  that  a  level  of  significance 
refers  to  both  tails  of  the  distribution.  Thus,  the  .02  level  (P  =  .02) 
includes  .01  of  the  area  of  the  curve  in  each  tail.  It  is  to  be  observed  that 
this  table  is  set  up  in  a  different  form  from  the  table  of  normal  curve  areas, 

Appendix  Table  I.  The  table  of  normal  curve  areas  showed  values  of  -  in  the 

-       x  ° 

margins  and  proportionate  areas  from  X  to  —  (one  direction  only)  in  the 

body.  A  tail  of  the  normal  distribution  is  obtained  by  subtracting  this 
value  from  .5000.  Doubling  the  resulting  figure  yields  the  level  of  signi- 
ficance. The  t  table,  on  the  other  hand,  shows  n  (degrees  of  freedom)  in 
the  stub,  t  in  the  body,  and  P  (the  level  of  significance)  in  the  caption. 
The  last  row  of  the  t  table,  for  N  =  oo,  shows  t  values  as  obtained  from 
the  normal  curve. 


Level  of  Significance  (P) 

n 

.9 
.158 

.8 
.325 

.7 
.510 

.6 

.5 

.4 

.3 

.2 

.1 

.05 

.02 

.01 

.001 

1 

.727 

1.000 

1.376 

1.963 

3.078 

6.314 

12.706 

31.821 

63.657 

636.619 

2 

.142 

.289 

.445 

.617 

.816 

1.061 

1.386 

1.886 

2.920 

4.303 

6.965 

9.925 

31.598 

3 

.137 

.277 

.424 

.584 

.765 

.978 

1.250 

1.638 

2.353 

3.182 

4.541 

5.841 

12.941 

4 

.134 

.271 

.414 

.569 

.741 

.941 

1.190 

1.533 

2.132 

2.776 

3.747 

4.604 

8.610 

5 

.132 

.267 

.408 

.559 

.727 

.920 

1.156 

1.476 

2.015 

2.571 

3.365 

4.032 

6.859 

6 

.131 

.265 

.404 

.553 

.718 

.906 

1.134 

1.440 

1.943 

2.447 

3.143 

3.707 

5.959 

7 

.130 

.263 

.402 

.549 

.711 

.896 

1.119 

1.415 

1.895 

2.365 

2.998 

3.499 

5.405 

8 

.130 

.262 

.399 

.546 

.706 

.889 

1.108 

1.397 

1.860 

2.306 

2.896 

3.355 

5.041 

9 

.129 

.261 

.398 

.543 

.703 

.883 

1.100 

1.383 

1.833 

2.262 

2.821 

3.250 

4.781 

10 

.129 

.260 

.397 

.542 

.700 

.879 

1.093 

1.372 

1.812 

2.228 

2.764 

3.169 

4.587 

11 

.129 

.260 

.396 

.540 

.697 

.876 

1.088 

1.363 

1.796 

2.201 

2.718 

3.106 

4.437 

12 

.128 

.259 

.395 

.539 

.695 

.873 

1.083 

1.356 

1.782 

2.179 

2.681 

3.055 

4.318 

13 

.128 

.259 

.394 

.538 

.694 

.870 

1.079 

1.350 

1.771 

2.160 

2.650 

3.012 

4.221 

14 

.128 

.258 

.393 

.537 

.692 

.868 

1.076 

1.345 

1.761 

2.145 

2.624 

2.977 

4.140 

15 

.128 

.258 

.393 

.536 

.691 

.866 

1.074 

1.341 

1.753 

2.131 

2.602 

2.947 

4.073 

16 

.128 

.258 

.392 

.535 

.690 

.865 

1.071 

1.337 

1.746 

2.120 

2.583 

2.921 

4.015 

17 

.128 

.257 

.392 

.534 

.689 

.863 

1.069 

1.333 

1.740 

2.110 

2.567 

2.898 

3.965 

18 

.127 

.257 

.392 

.534 

.688 

.862 

1.067 

1.330 

1.734 

2.101 

2.552 

2.878 

3.922 

19 

.127 

.257 

.391 

.533 

.688 

.861 

1.066 

1.328 

1.729 

2.093 

2.539 

2.861 

3.883 

20 

.127 

.257 

.391 

.533 

.687 

.860 

1.064 

1.325 

1.725 

2.086 

2.528 

2.845 

3.850 

21 

.127 

.257 

.391 

.532 

.686 

.859 

1.063 

1.323 

1.721 

2.080 

2.518 

2.831 

3.819 

22 

.127 

.256 

.390 

.532 

.686 

.858 

1.061 

1.321 

1.717 

2.074 

2.508 

2.819 

3.792 

23 

.127 

.256 

.390 

.532 

.685 

.858 

1.060 

1.319 

1.714 

2.069 

2.500 

2.807 

3.767 

24 

.127 

.256 

.390 

.531 

.685 

.857 

1.059 

1.318 

1.711 

2.064 

2.492 

2.797 

3.745 

25 

.127 

.256 

.390 

.531 

.684 

.856 

1.058 

1.316 

1.708 

2.060 

2.485 

2.787 

3.725 

26 

.127 

.256 

.390 

.531 

.684 

.856 

1.058 

1.315 

1.706 

2.056 

2.479 

2.779 

3.707 

27 

.127 

.256 

.389 

.531 

.684 

.855 

1.057 

1.314 

1.703 

2.052 

2.473 

2.771 

3.690 

28 

.127 

.256 

.389 

.530 

.683 

.855 

1.056 

1.313 

1.701 

2.048 

2.467 

2.763 

3.674 

29 

.127 

.256 

.389 

.530 

.683 

.854 

1.055 

1.311 

1.699 

2.045 

2.462 

2.756 

3.659 

30 

.127 

.256 

.389 

.530 

.683 

.854 

1.055 

1.310 

1.697 

2.042 

2.457 

2.750 

3.646 

40 

.126 

.255 

.388 

.529 

.681 

.851 

1.050 

1.303 

1.684 

2.021 

2.423 

2.704 

3.551 

60 

.126 

.254 

.387 

.527 

.679 

.848 

1.046 

1.296 

1.671 

2.000 

2.390 

2.660 

3.460 

120 

.126 

.254 

.386 

.526 

.677 

.845 

1.041 

1.289 

1.658 

1.980 

2.358 

2.617 

3.373 

00 

.126 

.253 

.385 

.524 

.674 

.842 

1.036 

1.282 

1.645 

1.960 

2.326 

2.576 

3.291 

Appendix  Table  II  is  reprinted  from  Fisher  and  Yates:  "Statistical  Tables  for  Biological, 
Agricultural,  and  Medical  Research" ,  published  by  Oliver  and  Boyd,  Ltd.,  Edinburgh,  by  per- 
mission of  the  authors  and  publishers. 
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Table  III 

Ratio  of  Degrees  of  Freedom  to  (t)2 


Degrees 
of 

Probability  Level 

Freedom 

5% 

2% 

1% 

1 

0.006 

0.001 

0.0002 

2 

0.108 

0.041 

0.020 

3 

0.296 

0.145 

0.088 

4 

0.519 

0.285 

0.189 

5 

0.756 

0.442 

0.308 

6 

1.002 

0.607 

0.437 

7 

1.252 

0.778 

0.572 

8 

1.504 

0.954 

0.711 

9 

1.759 

1.131 

0.852 

10 

2.015 

1.309 

0.996 

11 

2.271 

1.489 

1.140 

12 

2.527 

1.670 

1.286 

13 

2.786 

1.851 

1.433 

14 

3.043 

2.033 

1.580 

15 

3.303 

2.216 

1.727 

16 

3.560 

2.398 

1.875 

17 

3.818 

2.580 

2.024 

18 

4.078 

2.764 

2.173 

19 

4.337 

2.947 

2.321 

20 

4.596 

3.130 

2.471 

21 

4.854 

3.312 

2.620 

22 

5.115 

3.498 

2.768 

23 

5.373 

3.680 

2.919 

24 

5.634 

3.865 

3.068 

25 

5.891 

4.048 

3.219 

26 

6.151 

4.231 

3.367 

27 

6.412 

4.415 

3.516 

28 

6.676 

4.601 

3.668 

29 

6.934 

4.784 

3.818 

30 

7.195 

4.969 

3.967 

40 

9.803 

6.813 

5.447 

60 

15.000 

10.504 

8.480 

120 

30.596 

21.582 

17.523 
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APPENDIX  Table  IV 


Ah 
o 

o 
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10.827 
13.815 
16.268 
18.465 
20.517 

22.457 
24.322 
26.125 
27.877 
29.588 

31.264 
32.909 
34.528 
36.123 
37.697 

39.252 
40.790 
42.312 
43.820 
45.315 

46.797 
48.268 
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51.179 
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co©  eoco  eo 
m  i>  ©  ©  © 
©  Th  qq  i> 
Th  m  d  co  os 
m  m  ^i  m  \a 

© 

moHNO 
eo  i-i  Th  c»  oo 

tONMNO 

d  os  m  eo  d 

co  m  ©  ©  © 

HNOOO 

oo  Th  ©  q  q 

d  oo  ©  m  co 

M  rH  CO  (M  CO 

in  t^  oo  i-i  oo 

Th  CON  OS© 

co  co  co  co  eo 

©  ©  m  ^  © 
©©©©© 
©  Th  q  ^  m 
co'  eo  Th  ©  i> 

CO  CO  CO  CO  CO 

co  ©  oo©  Th 
eo  ooeooo  i-h 

©  CO  ©  ©  CO 

CO©  rH  CO   Th 
CO  Th  Th  Th  Th 

co  eo  oo  oo  co 
Th  ©  n-oo© 
©  q  q  m  oo 
m  CD  oo  ©  © 
Th  Th  Th  Th  m 

Cm 

O 

CO 

q 

CO  -h  r^  00  00 
i-l  CO  CO  ©  00 

Th  oo  oo  ©  co 
in  t>©  1-hco 

CO  CO  00  ©  rH 

CO  CO  CO  t^© 

©  q  i-<  q  i-H 
in©  co"  ©'  rH 

rH  rH  ^H  rHCO 

oo  Th  (N  eo  © 
i-i  m  i>  i>  in 
q  ©  Th  qq 
csi  Th  in©  oo 

CSI  CO  CO  CO  CNJ 

co  in  ©  t^  © 
eo  ©  Th  oo  co 
©  q  co  ©  © 
©  ©  co  co  m 

co  co  eo  co  eo 

eo  ©  oo  ©  © 
Th  m  ©  t-  © 
q  ©  ©  com 

ON00OH 

CO  CO  CO  Th  Th 

©c  ©co  co 

m  Th  rH  ©  © 

q  h  Th©  q 
co"  Th"  m  ©  n 

02 

I 

in  II   TtiffinooN 

q  1     q  ©  q  Th  © 

co  in  i>  ©'  m 

NNNON 
CJ©©  rt© 
q  ©  in  ©  co 
csi  **  in  d  oo 

in  ©  co  in  © 
t>  co  ©  co© 
©  ©  eo  ©  © 

OS  rH  CO  CO  Th 
rHCO  COCO  CO 

©    l>    ©    Th© 

©  00  ©  Th  rH 

q  ^>oq  -h  ■* 

©  t>  00  ©  rH 

co  co  co  co  co 

i-h  Th  co  m  co 
i>  co  t>  i-h  m 

©  ©  rH  Th  © 

co  eo  m  d  N 

CO  CO  CO  CO  CO 

oo  i-h  co  m  n 
q  i-h  q  mt> 

00©  r^  CO  CO 
CO  Th   Th   Th   Th 

o 

M 

©  ©  in  i>eo 
c0Th'di>©' 

in  t^  co  Th  t^ 

Th  rH  ©  00  00 

q  q  q  q  q 
d  csi  co  -#  in 

m  ©  co  Th  t» 

Nt)«HCCO 
63  SO  00  O  09 

NOOOirHN 
rH  rH  rH  COCO 

CO  ©©  ThCO 

Th©  00©  rH 

m  i>  ©  q  Th 

co  Th  m  r~  oo 

CO  CO  CO  CO  CO 

IO  CO  t»  ©  CO 

rH  rH  ©©00 

©  q©  i-h  q 
©  ©  co  eo  Th 

CO  CO  CO  CO  CO 

M  H  CO  N  CD 
©  Th  rH  GO  m 

m  t>  ©  ©  q 
m  ©  i>  ©  d 

CO  CO  CO  CO  Th 

o 
Ph 

o 

q 

CO  OS  CO  ©  © 

Th  rH  Th  00  00 

©NOOJN 
M  CO  Th  in  t> 

00  CO  ©CO  CO 

in  ©  co  Th  Th 

lOOOON^ 

oo  ©  m  csi  co 

rH  CO  in  rH  rH 

eo  rH  oo  m  rH 
©  q  ©  rH  eo 
Th  in  d  co  os 

m  m  ©  o  oo 

©  1-H  ©  O  CO 
Th   ©   IV  ©   © 

©  i-h  csi  co  m 

CO  CO  CO  CO  CO 

rH  rH  ©  CO  O 
NOMiftN 

r-;  q  Th  m  © 
©  i>  oo  OS  © 

CO  CO  CO  CO  CO 

m  co  i>  ©  © 
©  rn  co  eo  m 

NOOHN 

1-5  co  Th"  m"  d 

eo  co  co  co  co 

O 

q 

■<*coinoO'* 

NOCNO 

©  Th  ©  q  © 
m'co'  eo  Th  d 

--H  CO  Th  ©  i-H 

co  oo  co  m  oo 
t>  oo  os  ©  r-5 

©  rH  ©  CO  CO 
©  rH  rH  CO  CO 
q  O  i-h  q  q 

co  Th  in  d  i> 

00  rH  rH  ©  in 
i-H  i-H  ©  OOt'- 
Th  lrt  ©  ©  l> 

oo  ©  ©  i-h  co 

rH   rH  CO  CO  CO 

00  ©  GO  C2  CO 

laWr-fflN 

co  -#  d  i>  co 

CO  CO  CO  CO  CO 

CO   ©    ,—   rH   © 
Th  rH  ©  ©  CO 

co  co  co  Th  m 
©  ©  -h  co  eo" 
co  co  eo  eo  eo 

^      d 

O          ~s 

5    a 

^e        o 

o 
q 

lO(D©NH 

in  oo  cd  >n  in 

Th  q  eo  q  q 

'  M  CO  eo'  T* 

00©  Th  CO  CO 

eo  co  eo  co  co 
in  ©'  r>"  co  os 

rH  ©  ©  ©  © 
Th  Th  Th  co  co 
co  co  eo  co  eo 

©  rHCO  CO  Th" 

00  00  00  00 1<- 

co  eo  eo  eo  eo 
eo  q  q  q  q 
m  ©  t>  oo  © 

I>  t~-  N- 1>  l> 

co  eo  co  coco 
eo  co  co  eo  eo 

©  rHCO  CO   Th 

CO  CO  CO  CO  CO 

©  eo  co  ©  © 
co  eo  eo  co  co 
q  q  q  q  q 
m  d  n  oo  © 
CO  coco  coco 

O 

co  eo  Th  in  © 

rJHlNfflO 
'  M  CO*  CO 

oo  i-i  i>-  co  j> 

OOCOiOMN 

co  T^ind  i> 

00  Th  ©   rH  rH 

Th  CO  CO  CO  CO 
rH  ©  ©  00N 
00  OS  ©  ©  rH 

-h  rH  ©   CO  © 

co  co  Th  m  © 
©  m  Th  co  q 
co  eo  Th  m  © 

CO  rH  rH  CO  t> 

00©  CO  Th  © 
rH  r-*  ©  ©  q 
I>  00  OS  oi  © 

HHHrtN 

CO  ©  N-  I>  00 

©  rH  Th  t>  © 

r--  j>  ©  m  m 
i-h  co  co  Th  m 
coco  COCO  CO 

V 

Freed 

o 

00 

CO 

©   Th   ©   Th   Th 

q*o»M 

"mMCS* 

I>  CO  ©  00  1> 

qcoiflMH 
eo  co  Th  in  © 

©  1^  Thl>I> 

oo  ©  co  ©  © 
©  oo  ©  Th  co 

©  t>  00  OS  © 

COCOI>©00 

m  ©  m  7-t  i> 

nqooNin 

rH  CO  CO  CO  Th 

m  Th  i>  co  © 

Th  rH  00©  Th 

Thq  i-h©  © 
m  ©  t>  co  co 

©  eo  oo  m  Th 
co  ©  oo  «>© 
oo  t>  m  -*  co 

©  ©   rH  CSJ  CO 

rHCO  COCO  CO 

Cm 

O 

02 

O 

o 

ft 

d 

o 
q 

X 

m  M  Th  Th  © 

HH00OH 

©  q  q  q© 

Th  eo  ©  oom 
©co  ©  ©  © 
qqTh  th  oo 
csi  co'  co  Th  i* 

00  Th  CO  ©  IS- 

N©   Th©   Th 

q  q  ©  i>  q 
ioco'nnoo 

co  m  m  i-h  co 

i-h  oo  ©  m  Th 
q  ©  q  ©  Th 

©'  ©  ©  rH  CO 

©  rH  oo©  eo 
Th  Th  Th  m  i> 
q  q  q  ©  Th 
eo'  Th  Th  m  © 

CO  Th  ©  00© 

©  rH  CO  ©  © 

q  rn  ©  i>-  m 

i>  oo  oo  ©  © 

,_,   M    rH    rH   CO 

q 

CO 
OS 

co  coco  i-i  m 
©  om  m  th 

OHCONH 

in  t«-  eo  in  © 

eo  ©  eo  co  -# 

COHNMOS 

m  co  co  co  co 

in  ©  co  m  i-h 

J>C0  ©  t>  © 

moiaOiOOi 
Th  in  in©  i> 

CO  CO  ©  I>M 

©  n  ©  i-h  m 

©  ©  CO  rH  00 

i>o6©'©  d 

M00  rH  00  rH 

©  eo  ©  Th  i-h 
lonqooo 
i-i  co"  co  co  Th 

©  rH  00  00  CO 

t>  m  co  ©  © 
q  i-h  ©  i>  Th 
ifidcNoo 

> 
O 

M 

o 

00 

q 

00 
C0^ 

§  ©  in©  co 

o  <*  coco  m 

OOH^N 

Th  Th  CO  CO  © 

co  ©  co  eo  in 
i-h  in  ©  q  © 
M  rH  csi  co  CO 

©  com  com 

ONCOCOOO 

©  i-h  i>  q  © 
eo  ■<*  Th  m  m 

■>*  m©  i>  t> 

i-h  m  ©  ©  co 
©  co  q  mq 
dt>  t>  coos' 

m  ©  eo  co  t> 

rH  ©  ©   ©  © 

©  ©  CO  ©  © 

©  ©  rH  rH  CO 

©  m  r-  Th© 

©  CO  Th  t>  © 

co  Th  Th  m  © 

© 

q 

m 

OOiONrH 

©  ©  i-h  q  q 

CO  ©©  00  00 

t>  eo  Th  oo  m 

oo  q  q  ©  q 

'mi-hcoco 

MHNOO 

m  i>  ©  ©  co 

©  q  i-h  ©  q 
coco  Th"  ■*  m 

co  oo  meo  © 

rH  ©  rH  eo  © 

qTh©  ©q 
m©  t>i>o6 

t>C0©©  Th 

©  Th  ©  m  co 

COOS©©  rH 

co  ©  m©  eo 
©  t>-  ©  m  m 
rn  q  m  q  © 

CO  CO  CO  Th  Th 

For  lai 

ge 

va 

h(M«  Tj*m 

ues  of  n 

©  I>  00©  © 

compute 

rHCoeoThin      ©t>oo©©      i-ncocOThm      ©ooo©© 

,_,  ,_,  ,_l  ,H  rH        M  rH  rH  rH  CO        CO  CO  CO  CO  CO        CO  CO  CO  CO  CO 

j/2x2,  the  distribution  of  which  is  ap- 

proximately  normal  around  a  mean  of  y  2n  —  1  with  a  =  1 .  P  is  the  ratio 
of  one  tail  of  the  normal  distribution  to  the  area  under  the  entire  curve. 

A  detailed  table  of  the  probability  of  various  values  of  x2  for  one  degree 
of  freedom  is  given  in  G.  U.  Yule  and  M.  G.  Kendall,  An  Introduction  to  the 
Theory  of  Statistics,  11th  edition,  pp.  534-535,  Charles  Griffin  and  Co., 
London,  1937. 

Appendix  Table  IV  is  reprinted  from  Fisher  and  Yates:  "Statistical  Tables  for  Biological, 
Agricultural,  and  Medical  Research",  published  by  Oliver  and  Boyd,  Ltd.,  Edinburgh,  by  per- 
mission of  the  authors  and  publishers. 
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PELATIVE     HEIGHT 
Of    ORDINATE 


APPENDIX 
Figure  I  &  II 


RELATIVE    HEIGHT 
Of    ORDINATE 


VALUE    OF    X 


VALUE   OF  X 


Distribution  of  x2  for  n  =  1,  n  =  5,  n  =  9,  and  w  =  17.  The  maximum 
ordinate  is  at  y?  =  n  —  2  except  when  n  =  1.  When  n  =  1,  the  max- 
imum ordinate  is  at  y2,  =  0-  When  n  =  1,  there  is  4.55  per  cent  of  the 
curve  beyond  y?  =  4.  Beyond  y2  =  30  there  is  .0015  of  one  per  cent 
of  the  curve  when  n  —  5 ;  .0439  of  one  per  cent  of  the  curve  when  n  =  9 ; 
2.6345  per  cent  of  the  curve  when  n  =  17.  The  two  charts  have  been 
drawn  to  different  scales.  If  the  vertical  axis  of  the  upper  chart  is  ex- 
panded to  approximately  20  times  its  length  and  the  horizontal  axis  is 
contracted  to  about  one-eighth  of  its  length,  the  curves  will  be  roughly 
comparable  as  to  area. 


222 


STATISTICS  AND  HIGHWAY  TRAFFIC  ANALYSIS 


APPEN- 

5  °/o  and  1  %  Points  for  Distribution  of  F. 


7 
8 

9 
10 
11 
12 
13 
14 
15 
16 
17 

18 
19 
20 
21 

22 
23 
24 
25 
26 


Th 

degrees  of  freedom  (for  greater  mean 

square) 

1 

2 

3 

4 

5 

0 

7 

8 

9 

10 

11 

12 

161 
4,052 

200 
4,999 

216 
5,403 

225 
5,625 

230 
5,764 

234 
5,859 

237 
5,928 

239 
5,981 

241 
6,022 

242 
6,056 

243 
6,082 

244 
6,106 

18.51 
98.49 

19.00 
99.00 

19.16 
99.17 

19.25 
99.25 

19.30 
99.30 

19.33 
99.33 

19.36 
99.34 

19.37 
99.36 

19.38 
99.38 

19.39 
99.40 

19.40 
99.41 

19.41 
99.42 

10.13 
34.12 

9.55 
30.82 

9.28 
29.46 

9.12 
28.71 

9.01 
28.24 

8.94 
27.91 

8.88 
27.67 

8.84 
27.49 

8.81 
27.34 

8.78 
27.23 

8.76 
27.13 

8.74 
27.05 

7.71 
21.20 

6.94 
18.00 

6.59 
16.69 

6.39 
15.98 

6.26 
15.52 

6.16 
15.21 

6.09 
14.98 

6.04 
14.80 

6.00 
14.66 

5.96 
14.54 

5.93 
14.45 

5.91 
14.37 

6.61 
16.26 

6.79 
13.27 

5.41 
12.06 

5.19 
11.39 

5.05 
10.97 

4.95 
10.67 

4.88 
10.45 

4.82 
10.27 

4.78 
10.15 

4.74 
10.05 

4.70 
9.96 

4.68 
9.89 

5.99 
13.74 

5.14 
10.92 

4.76 
9.78 

4.53 
9.15 

4.39 
8.75 

4.28 
8.47 

4.21 
8.26 

4.15 
8.10 

4.10 
7.98 

4.06 
7.87 

4.03 
7.79 

4.00 
7.72 

5.59 
12.25 

4.74 
9.55 

4.35 
8.45 

4.12 
7.85 

3.97 
7.46 

3.87 
7.19 

3.79 
7.00 

3.73 
6.84 

3.68 
6.71 

3.63 
6.62 

3.60 
6.54 

3.57 
6.47 

5.32 
11.26 

4.46 
8.65 

4.07 
7.59 

3.84 
7.01 

3.69 
6.63 

3.58 
6.37 

3.50 
6.19 

3.44 
6.03 

3.39 
5.91 

3.34 
5.82 

3.31 
5.74 

3.28 
5.67 

5.12 
10.56 

4.26 
8.02 

3.86 
6.99 

3.63 
6.42 

3.48 
6.06 

3.37 
5.80 

3.29 
5.62 

3.23 
5.47 

3.18 
5.35 

3.13 
5.26 

3.10 
5.18 

3.07 
5.11 

4.96 
10.04 

4.10 
7.56 

3.71 
6.55 

3.48 
5.99 

3.33 
5.64 

3.22 
5.39 

3.14 
5.21 

3.07 
5.06 

3.02 
4.95 

2.97 
4.85 

2.94 
4.78 

2.91 
4.71 

4.84 
9.65 

3.98 
7.20 

3.59 
6.22 

3.36 
5.67 

3.20 
5.32 

3.09 
5.07 

3.01 
4.88 

2.95 
4.74 

2.90 
4.63 

2.86 
4.54 

2.82 
4.46 

2.79 
4.40 

4.75 
9.33 

3.88 
6.93 

3.49 
5.95 

3.26 
5.41 

3.11 
5.06 

3.00 
4.82 

2.92 
4.65 

2.85 
4.50 

2.80 
4.39 

2.76 
4.30 

2.72 
4.22 

2.69 
4.16 

4.67 
9.07 

3.80 
6.70 

3.41 
5.74 

3.18 
5.20 

3.02 
4.86 

2.92 
4.62 

2.84 
4.44 

2.77 
4.30 

2.72 
4.19 

2.67 
4.10 

2.63 
4.02 

2.60 
3.96 

4.60 
8.86 

3.74 
6.51 

3.34 
5.56 

3.11 
5.03 

2.96 
4.69 

2.85 
4.46 

2.77 
4.28 

2.70 
4.14 

2.65 
4.03 

2.60 
3.94 

2.56 
3.86 

2.53 
3.80 

4.54 
8.68 

3.68 
6.36 

3.29 
5.42 

3.06 
4.89 

2.90 
4.56 

2.79 
4.32 

2.70 
4.14 

2.64 
4.00 

2.59 
3.89 

2.55 
3.80 

2.51 
3.73 

2.48 
3.67 

4.49 
8.53 

3.63 
6.23 

3.24 
5.29 

3.01 
4.77 

2.85 
4.44 

2.74 
4.20 

2.66 
4.03 

2.59 
3.89 

2.54 
3.78 

2.49 
3.69 

2.45 
3.61 

2.42 
3.55 

4.45 
8.40 

3.59 
6.11 

3.20 
5.18 

2.96 
4.67 

2.81 
4.34 

2.70 
4.10 

2.62 
3.93 

2.55 
3.79 

2.50 
3.68 

2.45 
3.59 

2.41 
3.52 

2.38 
3.45 

4.41 
8.28 

3.55 
6.01 

3.16 
5.09 

2.93 
4.58 

2.77 
4.25 

2.66 
4.01 

2.58 
3.85 

2.51 
3.71 

2.46 
3.60 

2.41 
3.51 

2.37 
3.44 

2.34 
3.37 

4.38 
8.18 

3.52 
5.93 

3.13 
5.01 

2.90 
4.50 

2.74 
4.17 

2.63 
3.94 

2.55 
3.77 

2.48 
3.63 

2.43 
3.52 

2.38 
3.43 

2.34 
3.36 

2.31 
3.30 

4.35 
8.10 

3.49 
5.85 

3.10 
4.94 

2.87 
4.43 

2.71 
4.10 

2.60 
3.87 

2.52 
3.71 

2.45 
3.56 

2.40 
3.45 

2.35 
3.37 

2.31 
3.30 

2.28 
3.23 

4.32 
8.02 

3.47 
5.78 

3.07 
4.87 

2.84 
4.37 

2.68 
4.04 

2.57 
3.81 

2.49 
3.65 

2.42 
3.51 

2.37 
3.40 

2.32 
3.31 

2.28 
3.24 

2.25 
3.17 

4.30 
7.94 

3.44 
5.72 

3.05 
4.82 

2.82 
4.31 

2.66 
3.99 

2.55 
3.76 

2.47 
3.59 

2.40 
3.45 

2.35 
3.35 

2.30 
3.26 

2.26 
3.18 

2.23 
3.12 

4.28 
7.88 

3.42 
5.66 

3.03 
4.76 

2.80 
4.26 

2.64 
3.94 

2.53 
3.71 

2.45 
3.54 

2.38 
3.41 

2.32 
3.30 

2.28 
3.21 

2.24 
3A4 

2.20 
3.07 

4.26 
7.82 

3.40 
5.61 

3.01 
4.72 

2.78 
4.22 

2.62 
3.90 

2.51 
3.67 

2.43 
3.50 

2.36 
3.36 

2.30 
3.25 

2.26 
3.17 

2.22 
3.09 

2.18 
3.03 

4.24 
7.77 

3.38 
5.57 

2.99 
4.68 

2.76 

4.18 

2.60 
3.86 

2.49 
3.63 

2.41 
3.46 

2.34 
3.32 

2.28 
3.21 

2.24 
3.13 

2.20 
3.05 

2.16 
2.99 

4.22 
7.72 

3.37 

5.53 

2.98 
4.64 

2.74 
4.14 

2.59 
3.82 

2.47 
3.59 

2.39 
3.42 

2.32 
3.29 

2.27 
3.17 

2.22 
3.09 

2.18 
3.02 

2.15 
2.96 

The  function,  F-e  with  exponent  2z,  is  computed  in  part  from  Fisher's  table  VI  (7).   Ad- 
Used  by  Permission  of  Iowa  State  College  Press,  Publishers  of  Snedecor's 
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DIX  Table  V 

(5  °/o  in  Roman  Type,  1  °/0  in  Bold  Face  Type). 


nx 

iegrees  of  freedom  (for  greater  mean  square) 

n, 

14 

16 

20 

24 

30 

40 

50 

75 

100 

200 

500 

cc 

245 
6,142 

246 
6,169 

248 
6,208 

249 
6,234 

250 
6,258 

251 
6,286 

252 
6,302 

253 
6,323 

253 
6.334 

254 
6,352 

254 
6,361 

254 
6,366 

1 

19.42 
99.43 

19.43 
99.44 

19.44 
99.45 

19.45 
99.46 

19.46 
99.47 

19.47 
99.48 

19.47 
99.48 

19.48 
99.49 

19.49 
99.49 

19.49 
99.49 

19.50 
99.50 

19.50 
99.50 

2 

8.71 
26.92 

8.69 
26.83 

8.66 
26.69 

8.64 
26.60 

8.62 
26.50 

8.60 
26.41 

8.58 
26.35 

8.57 
26.27 

8.56 
26.23 

8.54 
26.18 

8.54 
26.14 

8.53 
26.12 

3 

5.87 
14.24 

5.84 
14.15 

5.80 
14.02 

5.77 
13.93 

5.74 
13.83 

5.71 
13.74 

5.70 
13.69 

5.68 
13.61 

5.66 
13.57 

5.65 
13.52 

5.64 
13.48 

5.63 
13.46 

4 

4.64 
9.77 

4.60 
9.68 

4.56 
9.55 

4.53 
9.47 

4.50 
9.38 

4.46 
9.29 

4.44 
9.24 

4.42 
9.17 

4.40 
9.13 

4.38 
9.07 

4.37 
9.04 

4.36 
9.02 

5 

3.96 
7.60 

3.92 
7.52 

3.87 
7.39 

3.84 
7.31 

3.81 
7.23 

3.77 
7.14 

3.75 
7.09 

3.72 
7.02 

3.71 
6.99 

3.69 
6.94 

3.68 
6.90 

3.67 
6.88 

6 

3.52 
6.35 

3.49 
6.27 

3.44 
6.15 

3.41 
6.07 

3.38 
5.98 

3.34 
5.90 

3.32 
5.85 

3.29 
5.78 

3.28 
5.75 

3.25 
5.70 

3.24 
5.67 

5.65 

7 

3.23 
5.56 

3.20 
5.48 

3.15 
5.36 

3.12 
5.28 

3.08 
5.20 

3.05 
5.11 

3.03 
5.06 

3.00 
5.00 

2.98 
4.96 

2.96 
4.91 

2.94 
4.88 

2.93 
4.86 

8 

3.02 
5.00 

2.98 
4.92 

2.93 
4.80 

2.90 
4.73 

2.86 
4.64 

2.82 
4.56 

2.80 
4.51 

2.77 
4.45 

2.76 
4.41 

2.73 
4.36 

2.72 
4.33 

2.71 
4.31 

9 

2.86 
4.60 

2.82 
4.52 

2.77 
4.41 

2.74 
4.33 

2.70 
4.25 

2.67 
4.17 

2.64 
4.12 

2.61 
4.05 

2.59 
4.01 

2.56 
3.96 

2.55 
3.93 

2.54 
3.91 

10 

2.74 
4.29 

2.70 
4.21 

2.65 
4.10 

2.61 
4.02 

2.57 
3.94 

2.53 
3.86 

2.50 
3.80 

2.47 
3.74 

2.45 
3.70 

2.42 
3.66 

2.41 
3.62 

2.40 
3.60 

11 

2.64 
4.05 

2.60 
3.98 

2.54 
3.86 

2.50 
3.78 

2.46 
3.70 

2.42 
3.61 

2.40 
3.56 

2.36 
3.49 

2.35 
3.46 

2.32 
3.41 

2.31 
3.38 

2.30 
3.36 

12 

2.55 
385 

2.51 
3.78 

2.46 
3.67 

2.42 
3.59 

2.38 
3.51 

2.34 
3.42 

2.32 
3.37 

2.28 
3.30 

2.26 
3.27 

2.24 
3.21 

2.22 
3.18 

2.21 
3.16 

13 

2.48 
3.70 

2.44 
3.62 

2.39 
3.51 

2.35 
3.43 

2.31 
3.34 

2.27 
3.26 

2.24 
3.21 

2.21 
3.14 

2.19 
3.11 

2.16 
3.06 

2.14 
3.02 

2.13 
3.00 

14 

2.43 
3.56 

2.39 
3.48 

2.33 
3.36 

2.29 
3.29 

2.25 
3.20 

2.21 
3.12 

2.18 
3.07 

2.15 
3.00 

2.12 
2.97 

2.10 
2.92 

2.08 
2.89 

2.07! 
2.87 

15 

2.37 
3.45 

2.33 
3.37 

2.28 
3.25 

2.24 
3.18 

2.20 
3.10 

2.16 
3.01 

2.13 
2.96 

2.09 
2.89 

2.07 
2.86 

2.04 
2.80 

2.02 
2.77 

2.01 
2.75 

16 

2.33 
3.35 

2.29 
3.27 

2.23 
3.16 

2.19 
3.08 

2.15 
3.00 

2.11 
2.92 

2.08 
2.86 

2.04 
2.79 

2.02 
2.76 

1.99 
2.70 

1.97 
2.67 

1.96 
2.65 

17 

2.29 
3.27 

2.25 
3.19 

2.19 
3.07 

2.15 
3.00 

2.11 
2.91 

2.07 
2.83 

2.04 
2.78 

2.00 
2.71 

1.98 
2.68 

1.95 
2.62 

1.93 
2.59 

1.92' 
2.57 

18 

2.26 
3.19 

2.21 
3.12 

2.15 
3.00 

2.11 
2.92 

2.07 
2.84 

2.02 
2.76 

2.00 
2.70 

1.96 
2.63 

1.94 
2.60 

1.91 
2.54 

1.90 
2.51 

1-88 
2.49 

19 

2.23 
3.13 

2.18 
3.05 

2.12 
2.94 

2.08 
2.86 

2.04 
2.77 

1.99 
2.69 

1.96 
2.63 

1.92 
2.56 

1.90 
2.53 

1.87 
2.47 

1.85 
2.44 

1.84 
2.42 

20 

2.20 
3.07 

2.15 
2.99 

2.09 
2.88 

2.05 
2.80 

2.00 
2.72 

1.96 
2.63 

1.93 

2.58 

1.89 
2.51 

1.87 
2.47 

1.84 
2.42 

1.82 
2.38 

1.81 
2.36 

21 

2.18 
3.02 

2.13 
2.94 

2.07 
2.83 

2.03 
2.75 

1.98 
2.67 

1.93 
2.58 

1.91 
2.53 

1.87 
2.46 

1.84 
2.42 

1.81 
2.37 

1.80 
2.33 

1.78 
2.31 

22 

2.14 
2.97 

2.10 
2.89 

2.04 
2.78 

2.00 
2.70 

1.96 
2.62 

1.91 
2.53 

1.88 
2.48 

1.84 
2.41 

1.82 
2.37 

1.79 
2.32 

1.77 
2.28 

1.76; 
2.26 

23 

2.13 
2.93 

2.09 
2.85 

2.02 
2.74 

1.98 
2.66 

1.94 
2.58 

1.89 
2.49 

1.86 
2.44 

1.82 
2.36 

1.80 
2.33 

1.76 
2.27 

1.74 
2.23 

1.73 
2.21 

24 

2.11 
2.89 

2.06 
2.81 

2.00 
2.70 

1.96 
2.62 

1.92 
2.54 

1.87 
2.45 

1.84 
2.40 

1.80 
2.32 

1.77 
2.29 

1.74 
2.23 

1.72 
2.19 

1.71 
2.17 

25 

2.10 
2.86 

2.05 
2.77 

1.99 
2.66 

1.95 
2.58 

1.90 
2.50 

1.85 
2.41 

1.82 
2.36 

1.78 
2.28 

1.70 
2.25 

1.72 
2.19 

1.70 
2.15 

1.69 
2.13 

26 

ditional  entries  are  by  interpolation,  mostly  graphical. 
"Statistical  Method*,  4th  Edition". 
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5  %  and  1  %  Points  for  the  Distribution  of  F. 


n2 

tti 

degrees  of  freedom  (for  greate 

r  mean  square) 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

27 

4.21 
7.68 

3.35 
5.49 

2.96 
4.60 

2.73 
4.11 

2.57 
3.79 

2.46 
3.56 

2.37 
3.39 

2.30 
3.26 

2.25 
3.14 

2.20 
3.06 

2.16 
2.98 

2.13 
2.93 

28 

4.20 
7.64 

3.34 
5.45 

2.95 
4.57 

2.71 
4.07 

2.56 
3.76 

2.44 
3.53 

2.36 
3.36 

2.29 
3.23 

2.24 
3.11 

2.19 
3.03 

2.15 
2.95 

2.12 
2.90 

29 

4.18 
7.60 

3.33 
5.42 

2.93 
4.54 

2.70 
4.04 

2.54 
3.73 

2.43 
3.50 

2.35 
3.33 

2.28 
3.20 

2.22 
3.08 

2.18 
3.00 

2.14 
2.92 

2.10 
2.87 

30 

4.17 
7.56 

3.32 
5.39 

2.92 
4.51 

2.69 
4.02 

2.53 
3.70 

2.42 
3.47 

2.34 
3.30 

2.27 
3.17 

2.21 
3.06 

2.16 
2.98 

2.12 
2.90 

2.09 
2.84 

32 

4.15 
7.50 

3.30 
5.34 

2.90 
4.46 

2.67 
3.97 

2.51 
3.66 

2.40 
3.42 

2.32 
3.25 

2.25 
3.12 

2.19 
3.01 

2.14 
2.94 

2.10 
2.86 

2.07 
2.80 

34 

4.13 
7.44 

3.28 
5.29 

2.88 
4.42 

2.65 
3.93 

2.49 
3.61 

2.38 
3.38 

2.30 
3.21 

2.23 
3.08 

2.17 
2.97 

2.12 
2.89 

2.08 
2.82 

2.05 
2.76 

36 

4.11 
7.39 

3.26 
5.25 

2.86 
4.38 

2.63 
3.89 

2.48 
3.58 

2.36 
3.35 

2.28 
3.18 

2.21 
3.04 

2.15 
2.94 

2.10 
2.86 

2.06 
2.78 

2.03 
2.72 

38 

4.10 
7.35 

3.25 
5.21 

2.85 
4.34 

2.62 
3.86 

2.46 
3.54 

2.35 
3.32 

2.26 
3.15 

2.19 
3.02 

2.14 
2.91 

2.09 
2.82 

2.05 
2,75 

2.02 
2.69 

40 

4.08 
7.31 

3.23 
5.18 

2.84 
4.31 

2.61 
3.83 

2.45 
3.51 

2.34 
3.29 

2.25 
3.12 

2.18 
2.99 

2.12 
2.88 

2.07 
2.80 

2.04 
2.73 

2.00 
2.66 

42 

4.07 
7.27 

3.22 
5.15 

2.83 
4.29 

2.59 
3.80 

2.44 
3.49 

2.32 
3.26 

2.24 
3.10 

2.17 
2.96 

2.11 
2.86 

2.06 
2.77 

2.02 
2.70 

1.99 
2.64 

44 

4.06 
7.24 

3.21 
5.12 

2.82 
4.26 

2.58 
3.78 

2.43 
3.46 

2.31 
3.24 

2.23 
3.07 

2.16 
2.94 

2.10 
2.84 

2.05 
2.75 

2.01 
2.68 

1.98 
2.62 

46 

4.05 
7.21 

3.20 
5.10 

2.81 
4.24 

2.57 
3.76 

2.42 
3.44 

2.30 
3.22 

2.22 
3.05 

2.14 
2.92 

2.09 
2.82 

2.04 
2.73 

2.00 
2.66 

1.97 
2.60 

48 

4.04 
7.19 

3.19 
5.08 

2.80 
4.22 

2.56 
3.74 

2.41 
3.42 

2.30 
3.20 

2.21 
3.04 

2.14 
2.90 

2.08 
2.80 

2.03 
2.71 

1.99 
2.64 

1.96 
2.58 

50 

4.03 
7.17 

3.18 
5.06 

2.79 
4.20 

2.56 
3.72 

2.40 
3.41 

2.29 
3.18 

2.20 
3.02 

2.13 
2.88 

2.07 
2.78 

2.02 
2.70 

1.98 
2.62 

1.95 
2.56 

55 

4.02 
7.12 

3.17 
5.01 

2.78 
4.16 

2.54 
3.68 

2.38 
3.37 

2.27 
3.15 

2.18 
2.98 

2.11 
2.85 

2.05 
2.75 

2.00 
2.66 

1.97 
2.59 

1.93 
2.53 

60 

4.00 
7.08 

3.15 
4.98 

2.76 
4.13 

2.52 
3.65 

2.37 
3.34 

2.25 
3.12 

2.17 
2.95 

2.10 
2.82 

2.04 
2.72 

1.99 
2.63 

1.95 
2.56 

1.92 
2.50 

65 

3.99 
7.04 

3.14 
4.95 

2.75 
4.10 

2.51 
3.62 

2.36 
3.31 

2.24 
3.09 

2.15 
2.93 

2.08 
2.79 

2.02 
2.70 

1.98 
2.61 

1.94 
2.54 

1.90 
2.47 

70 

3.98 
7.01 

3.13 
4.92 

2.74 
4.08 

2.50 
3.60 

2.35 
3.29 

2.23 
3.07 

2.14 
2.91 

2.07 
2.77 

2.01 
2.67 

1.97 
2.59 

1.93 
2.51 

1.89 
2.45 

80 

3.96 
6.96 

3.11 
4.88 

2.72 
4.04 

2.48 
3.56 

2.33 
3.25 

2.21 
3.04 

2.12 
2.87 

2.05 
2.74 

1.99 
2.64 

1.95 
2.55 

1.91 
2.48 

1.88 
2.41 

100 

3.94 
6.90 

3.09 
4.82 

2.70 
3.98 

2.46 
3.51 

2.30 
3.20 

2.19 
2.99 

2.10 
2.82 

2.03 
2.69 

1.97 
2.59 

1.92 
2.51 

1.88 
2.43 

1.85 
2.36 

125 

3.92 
6.84 

3.07 
4.78 

2.68 
3.94 

2.44 
3.47 

2.29 
3.17 

2.17 
2.95 

2.08 
2.79 

2.01 
2.65 

1.95 
2.56 

1.90 
2.47 

1.86 
2.40 

1.83 
2.33 

150 

3.91 
6.81 

3.06 
4.75 

2.67 
3.91 

2.43 
3.44 

2.27 
3.14 

2.16 
2.92 

2.07 
2.76 

2.00 
2.62 

1.94 
2.53 

1.89 
2.44 

1.85 
2.37 

1.82 
2.30 

200 

3.89 
6.76 

3.04 
4.71 

2.65 
3.88 

2.41 
3.41 

2.26 
3.11 

2.14 
2.90 

2.05 
2.73 

1.98 
2.60 

1.92 
2.50 

1.87 
2.41 

1.83 
2.34 

1.80 
2.28 

400 

3.86 
6.70 

3.02 
4.66 

2.62 
3.83 

2.39 
3.36 

2.23 
3.06 

2.12 
2.85 

2.03 
2.69 

1.96 
2.55 

1.90 
2.46 

1.85 
2.37 

1.81 
2.29 

1.78 
2.23 

1000 

3.85 
6.66 

3.00 
4.62 

2.61 
3.80 

2.38 
3.34 

2.22 
3.04 

2.10 
2.82 

2.02 
2.66 

1.95 
2.53 

1.89 
2.43 

1.84 
2.34 

1.80 
2.26 

1.76 
2.20 

oo 

3.84 
6.64 

2.99 
4.60 

2.f,0 
3.78 

2.37 
3.32 

2.21 
3.02 

2.09 
2.80 

2.01 
2.64 

1.94 
2.51 

1.88 
2.4l 

1.S3 
2.32 

1.79 
2.24 

1.75 
2.18 
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Table  V  (Continued) 

(5  %  in  Roman  Type,  1  %  in  Bold  Face  Type). 


Tii  degrees  of  freedom  (for  greater)  mean  square) 

n2 

14 

16 

20 

24 

30 

40 

50 

75 

100 

200 

500 

oo 

2.08 
2.83 

2.03 
2.74 

1.97 
2.63 

1.93 
2.55 

1.88 
2.47 

1.84 
2.38 

1.80 
2.33 

1.76 
2.25 

1.74 
2.21 

1.71 
2.16 

1.68 
2.12 

1.67 
2.10 

27 

2.06 
2.80 

2.02 
2.71 

1.96 
2.60 

1.91 
2.52 

1.87 
2.44 

1.81 
2.35 

1.78 
2.30 

1.75 
2.22 

1.72 
2.18 

1.69 
2.13 

1.67 
2.09 

1.65 
2.06 

28 

2.05 
2.77 

2.00 
2.68 

1.94 
2.57 

1.90 
2.49 

1.85 
2.41 

1.80 
2.32 

1.77 
2.27 

1.73 
2.19 

1.71 
2.15 

1.68 
2.10 

1.65 
2.06 

1.64 
2.03 

29 

2.04 
2.74 

1.99 
2.66 

1.93 
2.55 

1.89 
2.47 

1.84 
2.38 

1.79 
2.29 

1.76 
2.24 

1.72 
2.16 

1.69 
2.13 

1.66 
2.07 

1.64 
2.03 

1.62 
2.01 

30 

2.02 
2.70 

1.97 
2.62 

1.91 
2.51 

1.86 
2.42 

1.82 
2.34 

1.76 
2.25 

1.74 
2.20 

1.69 
2.12 

1.67 
2.08 

1.64 
2.02 

1.61 
1.98 

1.59 
1.96 

32 

2.00 
2.66 

1.95 
2.58 

1.89 
2.47 

1.84 
2.38 

1.80 
2.30 

1.74 
2.21 

1.71 
2.15 

1.67 
2.08 

1.64 
2.04 

1.61 
1.98 

1.59 
1.94 

1.57 
1.91 

34 

1.98 
2.62 

1.93 
2.54 

1.87 
2.43 

1.82 
2.35 

1.78 
2.26 

1.72 
2.17 

1.69 
2.12 

1.65 
2.04 

1.62 
2.00 

1.59 
1.94 

1.56 
1.90 

1.55 
1.87 

36 

1.96 
2.59 

1.92 
2.51 

1.85 
2.40 

1.80 
2.32 

1.76 
2.22 

1.71 
2.14 

1.67 
2.08 

1.63 
2.00 

1.60 
1.97 

1.57 
1.90 

1.54 
1.86 

1.53 
1.84 

38 

1.95 
2.56 

1.90 
2.49 

1.84 
2.37 

1.79 
2.29 

1.74 
2.20 

1.69 
2.11 

1.66 
2.05 

1.61 
1.97 

1.59 
1.94 

1.55 
1.88 

1.53 
1.84 

1.51 
1.81 

40 

1.94 
2.54 

1.89 
2.46 

1.82 
2.35 

1.78 
2.26 

1.73 
2.17 

1.68 
2.08 

1.64 
2.02 

1.60 
1.94 

1.57 
1.91 

1.54 
1.85 

1.51 
1.80 

1.49 
1.78 

42 

1.92 
2.52 

1.88 
2.44 

1.81 
2.32 

1.76 
2.24 

1.72 
2.15 

1.66 
2.06 

1.63 
2.00 

1.58 
1.92 

1.56 
1.88 

1.52 
1.82 

1.50 
1.78 

1.48 
1.75 

44 

1.91 
2.50 

1.87 
2.42 

1.80 
2.30 

1.75 
2.22 

1.71 
2.13 

1.65 
2.04 

1.62 
1.98 

1.57 
1.90 

1.54 
1.86 

1.51 
1.80 

1.48 
1.76 

1.46 
1.72 

46 

1.90 
2.48 

1.86 
2.40 

1.79 
2.28 

1.74 
2.20 

1.70 
2.11 

1.64 
2.02 

1.61 
1.96 

1.56 
1.88 

1.53 
1.84 

1.50 
1.78 

1.47 
1.73 

1.45 
1.70 

48 

1.90 
2.46 

1.85 
2.39 

1.78 
2.26 

1.74 
2.18 

1.69 
2.10 

1.63 
2.00 

1.60 
1.94 

1.55 
1.86 

1.52 
1.82 

1.48 
1.76 

1.46 
1.71 

1.44 
1.68 

50 

1.88 
2.43 

1.83 
2.35 

1.76 
2.23 

1.72 
2.15 

1.67 
2.06 

1.61 
1.96 

1.58 
1.90 

1.52 
1.82 

1.50 
1.78 

1.46 
1.71 

1.43 
1.66 

1.41 
1.64 

55 

1.86 
2.40 

1.81 
2.32 

1.75 
2.20 

1.70 
2.12 

1.65 
2.03 

1.59 
1.92 

1.56 
1.87 

1.50 
1.79 

1.48 
1.74 

1.44 
1.68 

1.41 
1.63 

1.39 
1.60 

60 

1.85 
2.37 

1.80 
2.30 

1.73 
2.18 

1.68 
2.09 

1.63 
2.00 

1.57 
1.90 

1.54 
1.84 

1.49 
1.76 

1.46 
1.71 

1.42 
1.64 

1.39 
1.60 

1.37 
1.56 

65 

1.84 
2.25 

1.79 
2.28 

1.72 
2.15 

1.67 
2.07 

1.62 
1.98 

1.56 
1.88 

1.53 
1.82 

1.47 
1.74 

1.45 
1.69 

1.40 
1.62 

1.37 
1.56 

1.35 
1.53 

70 

1.82 
2.32 

1.77 
2.24 

1.70 
2.11 

1.65 
2.03 

1.60 
1.94 

1.54 
1.84 

1.51 
1.78 

1.45 
1.70 

1.42 
1.65 

1.38 
1.57 

1.35 
1.52 

1.32 
1.49 

80 

1.79 
2.26 

1.75 
2.19 

1.68 
2.06 

1.63 
1.98 

1.57 
1.89 

1.51 
1.79 

1.48 
1.73 

1.42 
1.64 

1.30 
1.59 

1.34 
1.51 

1.30 
1.46 

1.28 
1.43 

100 

1.77 
2.23 

1.72 
2.15 

1.65 
2.03 

1.60 
1.94 

1.55 
1.85 

1.49 
1.75 

1.45 
1.68 

1.39 
1.59 

1.36 
1.54 

1.31 
1.46 

1.27 
1.40 

1.25 
1.37 

125 

1.76 
2.20 

1.71 
2.12 

1.64 
2.00 

1.59 
1.91 

1.54 
1.83 

1.47 
1.72 

1.44 
1.66 

1.37 
1.56 

1.34 
1.51 

1.29 
1.43 

1.25 
1.37 

1.22 
1.33 

150 

1.74 
2.17 

1.69 
2.09 

1.62 
1.97 

1.57 
1.88 

1.52 
1.79 

1.45 
1.69 

1.42 
1.62 

1.35 
1.53 

1.32 
1.48 

1.26 
1.39 

1.22 
1.33 

1.19 
1.28 

200 

1.72 
2.12 

1.67 
2.04 

1.60 
1.92 

1.54 
1.84 

1.49 
1.74 

1.42 
1.64 

1.38 
1.57 

1.32 
1.47 

1.28 
1.42 

1.22 
1.32 

1.16 
1.24 

1.13 
1.19 

400 

1.70 
2.09 

1.65 
2.01 

1.58 
1.89 

1.53 
1.81 

1.47 
1.71 

1.41 
1.61 

1.36 
1.54 

1.30 
1.44 

1.26 
1.38 

1.19 
1.28 

1.13 
1.19 

1.08 
1.11 

1000 

1.69 
2.07 

1.64 
1.99 

1.57 
1.87 

1.52 
1.79 

1.46 
1.69 

1.40 
1.59 

1.35 
1.52 

1.28 
1.41 

1.24 
1.36 

1.17 
1.25 

1.11 
1.15 

1.00 
1.00 

00 
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APPENDIX  Table  VI 
Poisson  Tables 
Construction  of  the  Table  Giving  the  Probability  of  x  or  More  Events 
Happening  in  a  Given  Interval  if  im\  the  Average  Number  of  Events 
per  Interval  is  Known  -  The  probability  that   'x'  Events  will 
Happen  in  a  given  time  or  space  segment  is  equal  to 

_e-m(m*) 

Xn  — j ' 

X! 

where  x  refers  to  any  value  of  'n\ 

The  value  of  this  expression  for  various  values  of  'm'  and  'x'  is 
readily  available  in  standard  Poisson  tables. 

Thus  Pn  may  be  found  for  any  given  values  of  'x'  and  'm\  For 
example,  if  m  =  4  and  x-0. 


?o  = 

e"m  (m 

lx)      e-4  (4°) 

0  018 

x! 

0! 

m  = 

4  and 

X  =  1 

Pi= 

e_m  (m 

x)      e- 

4  (4M      0.0183  (4) 

x! 

1!                 1 

m  = 

:  4  and 

x^2 

p„-e- 

-4  (42) 

2! 

.0183  (16) 
2 

0.147 

m  = 

4  and 

x  =  3 

rZ 

4  (43)  _ 

o  I 

0.0183  (64) 

a 

-0.195 

This  procedure  can  of  course,  be  continued. 

The  probability  of  getting  three  or  less  is  the  sum  of  the  prob- 
ability of  getting  0,  1,  2  or  3  and  therefore  is  equal  0.018  +  0.073 
+  0.147  +  0.195  t=  0.433  »  43.3  in  100  or  43.3  per  cent.  The 
probability  of  getting  four  or  more  is  56.7  out  of  100  or  56.7  per 
cent.  This  follows  from  the  fact  that  the  total  probability  of  getting 
all  possible  numbers  is  one  or  100  per  cent.  This  is  the  procedure 
followed  in  the  calculation  of  the  tables.  Therefore,  the  values 
given  in  the  tables  are 

m°      m1      m2  m{x~l) 
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defined 22 

desirable  properties  of 58 

Averages 

moving 17 
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Bernoulli's  theorem 65,  66 

Bienayme-Tchebycheff  criterion 70 
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Determinants,  evaluation  of      134 
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of  arrays,  standard      116 

standard 45 

Dispersion  and  Variance 97 

Distribution 
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elements  of 61 

experimental 63 
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hypergeometric,  example 105 

interpretation  of  the  properties  of  normal 88 
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arithmetic,  properties  of 59 

arithmetic,  size  of  sample  for 145 

average  deviation 51 

centra  harmonic 51 

geometric 42,  60 

harmonic 44,  60 

population,  inference  concerning 141 
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Minimum  spacing  formula,  interpretation  of 154 
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Moments  of  a  Distribution 54 
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Pearson,  Karl 55 
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Population  mean,  inference  concerning 141 
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